2025-05-07T20:23:26.1393090Z Current runner version: '2.323.0'
2025-05-07T20:23:26.1398843Z Runner name: 'i-061cb0426579ace80'
2025-05-07T20:23:26.1399775Z Machine name: 'ip-10-0-29-91'
2025-05-07T20:23:26.1402490Z ##[group]GITHUB_TOKEN Permissions
2025-05-07T20:23:26.1404928Z Contents: read
2025-05-07T20:23:26.1405441Z Metadata: read
2025-05-07T20:23:26.1405919Z Packages: read
2025-05-07T20:23:26.1406410Z ##[endgroup]
2025-05-07T20:23:26.1408571Z Secret source: None
2025-05-07T20:23:26.1409212Z Prepare workflow directory
2025-05-07T20:23:26.2313191Z Prepare all required actions
2025-05-07T20:23:26.2352745Z Getting action download info
2025-05-07T20:23:26.4733888Z Download action repository 'actions/checkout@v4' (SHA:11bd71901bbe5b1630ceea73d27597364c9af683)
2025-05-07T20:23:26.7676259Z Download action repository 'actions/download-artifact@v4' (SHA:d3f86a106a0bac45b974a628896c90dbdf5c8093)
2025-05-07T20:23:27.1265939Z Download action repository 'pytorch/test-infra@main' (SHA:117fccdf5892ff9a958d2afb4b4b8b6e930d3187)
2025-05-07T20:23:28.7842033Z Getting action download info
2025-05-07T20:23:29.0872873Z Download action repository 'nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482' (SHA:3e91a01664abd3c5cd539100d10d33b9c5b68482)
2025-05-07T20:23:29.3205637Z Complete job name: test_and_publish_artifact (x86, linux.g5.4xlarge.nvidia.gpu, genai, 3.13, 12.8.0, 12.6.3, clang)
2025-05-07T20:23:29.3700753Z A job started hook has been configured by the self-hosted runner administrator
2025-05-07T20:23:29.3813101Z ##[group]Run '/home/ec2-user/runner-scripts/before_job.sh'
2025-05-07T20:23:29.3824393Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:29.3825026Z ##[endgroup]
2025-05-07T20:23:30.5860307Z Runner Type: linux.g5.4xlarge.nvidia.gpu
2025-05-07T20:23:30.5860727Z Instance Type: g5.4xlarge
2025-05-07T20:23:30.5860969Z AMI Name: unknown
2025-05-07T20:23:30.5898217Z AMI ID: ami-071226ecf16aa7d96
2025-05-07T20:23:35.9472564Z ##[group]Run actions/checkout@v4
2025-05-07T20:23:35.9472863Z with:
2025-05-07T20:23:35.9473082Z   submodules: true
2025-05-07T20:23:35.9473313Z   repository: pytorch/FBGEMM
2025-05-07T20:23:35.9473688Z   token: ***
2025-05-07T20:23:35.9473888Z   ssh-strict: true
2025-05-07T20:23:35.9474085Z   ssh-user: git
2025-05-07T20:23:35.9474303Z   persist-credentials: true
2025-05-07T20:23:35.9474541Z   clean: true
2025-05-07T20:23:35.9474767Z   sparse-checkout-cone-mode: true
2025-05-07T20:23:35.9475024Z   fetch-depth: 1
2025-05-07T20:23:35.9475230Z   fetch-tags: false
2025-05-07T20:23:35.9475442Z   show-progress: true
2025-05-07T20:23:35.9475653Z   lfs: false
2025-05-07T20:23:35.9475859Z   set-safe-directory: true
2025-05-07T20:23:35.9476090Z env:
2025-05-07T20:23:35.9476292Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:35.9476571Z   BUILD_ENV: build_binary
2025-05-07T20:23:35.9476807Z   BUILD_TARGET: genai
2025-05-07T20:23:35.9477036Z   BUILD_VARIANT: cuda
2025-05-07T20:23:35.9477294Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:23:35.9477538Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:35.9477766Z ##[endgroup]
2025-05-07T20:23:36.0638303Z Syncing repository: pytorch/FBGEMM
2025-05-07T20:23:36.0639566Z ##[group]Getting Git version info
2025-05-07T20:23:36.0639989Z Working directory is '/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM'
2025-05-07T20:23:36.0641066Z [command]/usr/bin/git version
2025-05-07T20:23:36.0641500Z git version 2.47.1
2025-05-07T20:23:36.0661436Z ##[endgroup]
2025-05-07T20:23:36.0676021Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/e68163b3-6f02-4d62-afcd-52334a42cb06' before making global git config changes
2025-05-07T20:23:36.0677121Z Adding repository directory to the temporary git global config as a safe directory
2025-05-07T20:23:36.0690657Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:23:36.0730093Z [command]/usr/bin/git config --local --get remote.origin.url
2025-05-07T20:23:36.0753412Z https://github.com/pytorch/FBGEMM
2025-05-07T20:23:36.0771258Z ##[group]Removing previously created refs, to avoid conflicts
2025-05-07T20:23:36.0776099Z [command]/usr/bin/git rev-parse --symbolic-full-name --verify --quiet HEAD
2025-05-07T20:23:36.0801871Z refs/heads/main
2025-05-07T20:23:36.0811771Z [command]/usr/bin/git checkout --detach
2025-05-07T20:23:36.9428485Z HEAD is now at b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079)
2025-05-07T20:23:36.9479771Z [command]/usr/bin/git branch --delete --force main
2025-05-07T20:23:36.9507483Z Deleted branch main (was b6b2ce3).
2025-05-07T20:23:36.9514645Z ##[endgroup]
2025-05-07T20:23:36.9517637Z [command]/usr/bin/git submodule status
2025-05-07T20:23:36.9941483Z  e5d7c0bd5d9aec44d68830187138149e6a8c4e32 external/asmjit (e5d7c0b)
2025-05-07T20:23:37.0028272Z  4a61bdd4bd4ed730e078aebc7c0fcf046ff29406 external/composable_kernel (4a61bdd)
2025-05-07T20:23:37.0114658Z  6543fec09b2f04ac4a666882998b534afc9c1349 external/cpuinfo (6543fec)
2025-05-07T20:23:37.0199386Z  3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3 external/cutlass (3ed8d2e)
2025-05-07T20:23:37.0283483Z  f8d7d77c06936315286eb55f8de22cd23c188571 external/googletest (f8d7d77)
2025-05-07T20:23:37.0369628Z  420084499c7c1e1c2d801922f40df202eac5f3a0 external/hipify_torch (4200844)
2025-05-07T20:23:37.0452151Z  9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03 external/json (9cca280)
2025-05-07T20:23:37.0465139Z ##[group]Cleaning the repository
2025-05-07T20:23:37.0469689Z [command]/usr/bin/git clean -ffdx
2025-05-07T20:23:37.0525837Z [command]/usr/bin/git reset --hard HEAD
2025-05-07T20:23:37.0631639Z HEAD is now at b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079)
2025-05-07T20:23:37.0638990Z ##[endgroup]
2025-05-07T20:23:37.0641268Z ##[group]Disabling automatic garbage collection
2025-05-07T20:23:37.0645669Z [command]/usr/bin/git config --local gc.auto 0
2025-05-07T20:23:37.0676801Z ##[endgroup]
2025-05-07T20:23:37.0677400Z ##[group]Setting up auth
2025-05-07T20:23:37.0682835Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2025-05-07T20:23:37.0724812Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
2025-05-07T20:23:37.1055540Z Entering 'external/asmjit'
2025-05-07T20:23:37.1122068Z Entering 'external/composable_kernel'
2025-05-07T20:23:37.1195102Z Entering 'external/cpuinfo'
2025-05-07T20:23:37.1262514Z Entering 'external/cutlass'
2025-05-07T20:23:37.1336212Z Entering 'external/googletest'
2025-05-07T20:23:37.1402067Z Entering 'external/hipify_torch'
2025-05-07T20:23:37.1468286Z Entering 'external/json'
2025-05-07T20:23:37.1552620Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2025-05-07T20:23:37.1584992Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
2025-05-07T20:23:37.1912719Z Entering 'external/asmjit'
2025-05-07T20:23:37.1979101Z Entering 'external/composable_kernel'
2025-05-07T20:23:37.2052990Z Entering 'external/cpuinfo'
2025-05-07T20:23:37.2118974Z Entering 'external/cutlass'
2025-05-07T20:23:37.2193415Z Entering 'external/googletest'
2025-05-07T20:23:37.2260439Z Entering 'external/hipify_torch'
2025-05-07T20:23:37.2326578Z Entering 'external/json'
2025-05-07T20:23:37.2415170Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic ***
2025-05-07T20:23:37.2465937Z ##[endgroup]
2025-05-07T20:23:37.2466424Z ##[group]Fetching the repository
2025-05-07T20:23:37.2473295Z [command]/usr/bin/git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=1 origin +a2f4c52051596e74bc8c16e3d2867a4ecdd271e0:refs/remotes/pull/4066/merge
2025-05-07T20:23:37.4642356Z From https://github.com/pytorch/FBGEMM
2025-05-07T20:23:37.4643148Z  * [new ref]         a2f4c52051596e74bc8c16e3d2867a4ecdd271e0 -> pull/4066/merge
2025-05-07T20:23:37.4668518Z ##[endgroup]
2025-05-07T20:23:37.4669044Z ##[group]Determining the checkout info
2025-05-07T20:23:37.4670359Z ##[endgroup]
2025-05-07T20:23:37.4674214Z [command]/usr/bin/git sparse-checkout disable
2025-05-07T20:23:37.4724238Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig
2025-05-07T20:23:37.4753968Z ##[group]Checking out the ref
2025-05-07T20:23:37.4757390Z [command]/usr/bin/git checkout --progress --force refs/remotes/pull/4066/merge
2025-05-07T20:23:37.4879730Z Previous HEAD position was b6b2ce3 Migrate TBE forward kernels to `FBGEMM_LAUNCH_KERNEL` (#4079)
2025-05-07T20:23:37.4882997Z HEAD is now at a2f4c52 Merge 6060cd4b5f971680caecdcc657faccb5720d1c3e into fd4df5f456e0cca514bacd98a39efb72990fd9f4
2025-05-07T20:23:37.4892660Z ##[endgroup]
2025-05-07T20:23:37.4893249Z ##[group]Setting up auth for fetching submodules
2025-05-07T20:23:37.4897907Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic ***
2025-05-07T20:23:37.4944940Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf
2025-05-07T20:23:37.4976416Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com:
2025-05-07T20:23:37.5007743Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com:
2025-05-07T20:23:37.5035805Z ##[endgroup]
2025-05-07T20:23:37.5036306Z ##[group]Fetching submodules
2025-05-07T20:23:37.5039123Z [command]/usr/bin/git submodule sync
2025-05-07T20:23:37.5412980Z Synchronizing submodule url for 'external/asmjit'
2025-05-07T20:23:37.5413501Z Synchronizing submodule url for 'external/composable_kernel'
2025-05-07T20:23:37.5414206Z Synchronizing submodule url for 'external/cpuinfo'
2025-05-07T20:23:37.5414853Z Synchronizing submodule url for 'external/cutlass'
2025-05-07T20:23:37.5416599Z Synchronizing submodule url for 'external/googletest'
2025-05-07T20:23:37.5417198Z Synchronizing submodule url for 'external/hipify_torch'
2025-05-07T20:23:37.5417676Z Synchronizing submodule url for 'external/json'
2025-05-07T20:23:37.5429053Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --depth=1
2025-05-07T20:23:37.5860347Z Submodule path 'external/asmjit': checked out 'e5d7c0bd5d9aec44d68830187138149e6a8c4e32'
2025-05-07T20:23:37.6008249Z Submodule path 'external/composable_kernel': checked out '4a61bdd4bd4ed730e078aebc7c0fcf046ff29406'
2025-05-07T20:23:37.6108602Z Submodule path 'external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349'
2025-05-07T20:23:37.6276279Z Submodule path 'external/cutlass': checked out '3ed8d2ec4ba35ef5d9d8353826209b6f868f63d3'
2025-05-07T20:23:37.6365916Z Submodule path 'external/googletest': checked out 'f8d7d77c06936315286eb55f8de22cd23c188571'
2025-05-07T20:23:37.6448121Z Submodule path 'external/hipify_torch': checked out '420084499c7c1e1c2d801922f40df202eac5f3a0'
2025-05-07T20:23:37.6549631Z Submodule path 'external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03'
2025-05-07T20:23:37.6566793Z [command]/usr/bin/git submodule foreach git config --local gc.auto 0
2025-05-07T20:23:37.6898521Z Entering 'external/asmjit'
2025-05-07T20:23:37.6931126Z Entering 'external/composable_kernel'
2025-05-07T20:23:37.6962578Z Entering 'external/cpuinfo'
2025-05-07T20:23:37.6995872Z Entering 'external/cutlass'
2025-05-07T20:23:37.7027967Z Entering 'external/googletest'
2025-05-07T20:23:37.7061044Z Entering 'external/hipify_torch'
2025-05-07T20:23:37.7093035Z Entering 'external/json'
2025-05-07T20:23:37.7137812Z ##[endgroup]
2025-05-07T20:23:37.7138649Z ##[group]Persisting credentials for submodules
2025-05-07T20:23:37.7145939Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :"
2025-05-07T20:23:37.7474839Z Entering 'external/asmjit'
2025-05-07T20:23:37.7516736Z url.https://github.com/.insteadof
2025-05-07T20:23:37.7517473Z url.https://github.com/.insteadof
2025-05-07T20:23:37.7559646Z Entering 'external/composable_kernel'
2025-05-07T20:23:37.7601641Z url.https://github.com/.insteadof
2025-05-07T20:23:37.7602028Z url.https://github.com/.insteadof
2025-05-07T20:23:37.7650414Z Entering 'external/cpuinfo'
2025-05-07T20:23:37.7692391Z url.https://github.com/.insteadof
2025-05-07T20:23:37.7692922Z url.https://github.com/.insteadof
2025-05-07T20:23:37.7734065Z Entering 'external/cutlass'
2025-05-07T20:23:37.7776084Z url.https://github.com/.insteadof
2025-05-07T20:23:37.7776590Z url.https://github.com/.insteadof
2025-05-07T20:23:37.7826419Z Entering 'external/googletest'
2025-05-07T20:23:37.7868838Z url.https://github.com/.insteadof
2025-05-07T20:23:37.7869294Z url.https://github.com/.insteadof
2025-05-07T20:23:37.7910986Z Entering 'external/hipify_torch'
2025-05-07T20:23:37.7970204Z url.https://github.com/.insteadof
2025-05-07T20:23:37.7970641Z url.https://github.com/.insteadof
2025-05-07T20:23:37.7995173Z Entering 'external/json'
2025-05-07T20:23:37.8037010Z url.https://github.com/.insteadof
2025-05-07T20:23:37.8037397Z url.https://github.com/.insteadof
2025-05-07T20:23:37.8098627Z [command]/usr/bin/git submodule foreach sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url"
2025-05-07T20:23:37.8430277Z Entering 'external/asmjit'
2025-05-07T20:23:37.8492611Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/asmjit/config	remote.origin.url
2025-05-07T20:23:37.8495272Z Entering 'external/composable_kernel'
2025-05-07T20:23:37.8556198Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/composable_kernel/config	remote.origin.url
2025-05-07T20:23:37.8559234Z Entering 'external/cpuinfo'
2025-05-07T20:23:37.8619909Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cpuinfo/config	remote.origin.url
2025-05-07T20:23:37.8622854Z Entering 'external/cutlass'
2025-05-07T20:23:37.8684581Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/cutlass/config	remote.origin.url
2025-05-07T20:23:37.8687837Z Entering 'external/googletest'
2025-05-07T20:23:37.8750054Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/googletest/config	remote.origin.url
2025-05-07T20:23:37.8753103Z Entering 'external/hipify_torch'
2025-05-07T20:23:37.8814120Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/hipify_torch/config	remote.origin.url
2025-05-07T20:23:37.8816982Z Entering 'external/json'
2025-05-07T20:23:37.8878246Z file:/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/.git/modules/external/json/config	remote.origin.url
2025-05-07T20:23:37.8993654Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:'
2025-05-07T20:23:37.9325257Z Entering 'external/asmjit'
2025-05-07T20:23:37.9358433Z Entering 'external/composable_kernel'
2025-05-07T20:23:37.9389912Z Entering 'external/cpuinfo'
2025-05-07T20:23:37.9421395Z Entering 'external/cutlass'
2025-05-07T20:23:37.9454188Z Entering 'external/googletest'
2025-05-07T20:23:37.9485316Z Entering 'external/hipify_torch'
2025-05-07T20:23:37.9516642Z Entering 'external/json'
2025-05-07T20:23:37.9564675Z [command]/usr/bin/git submodule foreach git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:'
2025-05-07T20:23:37.9892149Z Entering 'external/asmjit'
2025-05-07T20:23:37.9924847Z Entering 'external/composable_kernel'
2025-05-07T20:23:37.9956745Z Entering 'external/cpuinfo'
2025-05-07T20:23:37.9988287Z Entering 'external/cutlass'
2025-05-07T20:23:38.0021572Z Entering 'external/googletest'
2025-05-07T20:23:38.0056356Z Entering 'external/hipify_torch'
2025-05-07T20:23:38.0088407Z Entering 'external/json'
2025-05-07T20:23:38.0137860Z ##[endgroup]
2025-05-07T20:23:38.0179552Z [command]/usr/bin/git log -1 --format=%H
2025-05-07T20:23:38.0206742Z a2f4c52051596e74bc8c16e3d2867a4ecdd271e0
2025-05-07T20:23:38.0379655Z ##[group]Run actions/download-artifact@v4
2025-05-07T20:23:38.0379960Z with:
2025-05-07T20:23:38.0380202Z   name: fbgemm_genai_x86_clang_py3.13_cu12.8.0.whl
2025-05-07T20:23:38.0380519Z   merge-multiple: false
2025-05-07T20:23:38.0380764Z   repository: pytorch/FBGEMM
2025-05-07T20:23:38.0381017Z   run-id: 14891846252
2025-05-07T20:23:38.0381225Z env:
2025-05-07T20:23:38.0381444Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:38.0381735Z   BUILD_ENV: build_binary
2025-05-07T20:23:38.0381977Z   BUILD_TARGET: genai
2025-05-07T20:23:38.0382192Z   BUILD_VARIANT: cuda
2025-05-07T20:23:38.0382430Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:23:38.0382671Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:38.0382900Z ##[endgroup]
2025-05-07T20:23:38.2673056Z Downloading single artifact
2025-05-07T20:23:38.3922073Z Preparing to download the following artifacts:
2025-05-07T20:23:38.3922879Z - fbgemm_genai_x86_clang_py3.13_cu12.8.0.whl (ID: 3081408483, Size: 18517235, Expected Digest: sha256:2c430e283306050771ed0148f8bc0ff9c88d696c9122c4b4956d4418e1e568bd)
2025-05-07T20:23:38.4520988Z Redirecting to blob download url: https://productionresultssa4.blob.core.windows.net/actions-results/b81c1ade-b872-4473-afc9-b227c140a38f/workflow-job-run-569411a8-0277-5d3b-912a-bdc2bb6543f6/artifacts/b2a3a6e3a6de2b82b0a644dc87c7372954cdbe64040c2d38887481b8860a6fb7.zip
2025-05-07T20:23:38.4522380Z Starting download of artifact to: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:23:38.5742479Z (node:58325) [DEP0005] DeprecationWarning: Buffer() is deprecated due to security and usability issues. Please use the Buffer.alloc(), Buffer.allocUnsafe(), or Buffer.from() methods instead.
2025-05-07T20:23:38.5743436Z (Use `node --trace-deprecation ...` to show where the warning was created)
2025-05-07T20:23:38.8551631Z SHA256 digest of downloaded artifact is 2c430e283306050771ed0148f8bc0ff9c88d696c9122c4b4956d4418e1e568bd
2025-05-07T20:23:38.8552188Z Artifact download completed successfully.
2025-05-07T20:23:38.8558237Z Total of 1 artifact(s) downloaded
2025-05-07T20:23:38.8558550Z Download artifact has finished successfully
2025-05-07T20:23:38.8802249Z ##[group]Run pytorch/test-infra/.github/actions/setup-nvidia@main
2025-05-07T20:23:38.8802630Z with:
2025-05-07T20:23:38.8802836Z   driver-version: 570.133.07
2025-05-07T20:23:38.8803071Z env:
2025-05-07T20:23:38.8803278Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:38.8803568Z   BUILD_ENV: build_binary
2025-05-07T20:23:38.8803809Z   BUILD_TARGET: genai
2025-05-07T20:23:38.8804027Z   BUILD_VARIANT: cuda
2025-05-07T20:23:38.8804252Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:23:38.8804502Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:38.8804732Z ##[endgroup]
2025-05-07T20:23:38.8900311Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482
2025-05-07T20:23:38.8900680Z with:
2025-05-07T20:23:38.8900874Z   timeout_minutes: 10
2025-05-07T20:23:38.8901100Z   max_attempts: 3
2025-05-07T20:23:38.8923905Z   command: # Is it disgusting to have a full shell script here in this github action? Sure
# But is it the best way to make it so that this action relies on nothing else? Absolutely
set -eou pipefail

DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID)
DRIVER_FN="NVIDIA-Linux-x86_64-${DRIVER_VERSION}.run"

install_nvidia_docker2_amzn2() {
    (
        set -x
        # Needed for yum-config-manager
        sudo yum install -y yum-utils
        if [[ "${DISTRIBUTION}" == "amzn2023" ]] ; then
          YUM_REPO_URL="https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo"
        else
          # Amazon Linux 2
          YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo"
        fi

        sudo yum-config-manager --add-repo "${YUM_REPO_URL}"
        sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
        sudo systemctl restart docker
    )
}

install_nvidia_docker2_ubuntu20() {
    (
        set -x
        # Install nvidia-driver package if not installed
        status="$(dpkg-query -W --showformat='${db:Status-Status}' nvidia-docker2 2>&1)"
        if [ ! $? = 0 ] || [ ! "$status" = installed ]; then
          sudo apt-get install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
          sudo systemctl restart docker
        fi
    )
}

pre_install_nvidia_driver_amzn2() {
    (
        # Purge any nvidia driver installed from RHEL repo
        sudo yum remove -y nvidia-driver-latest-dkms
    )
}

install_nvidia_driver_common() {
    (
        # Try to gather more information about the runner and its existing NVIDIA driver if any
        echo "Before installing NVIDIA driver"
        lspci
        lsmod
        modinfo nvidia || true

        HAS_NVIDIA_DRIVER=0
        # Check if NVIDIA driver has already been installed
        if [ -x "$(command -v nvidia-smi)" ]; then
            set +e
            # The driver exists, check its version next. Also check only the first GPU if there are more than one of them
            # so that the same driver version is not print over multiple lines
            INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0)
            NVIDIA_SMI_STATUS=$?

            if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then
                echo "Failed to get NVIDIA driver version ($INSTALLED_DRIVER_VERSION). Continuing"
            elif [ "$INSTALLED_DRIVER_VERSION" != "$DRIVER_VERSION" ]; then
                echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has been installed, but we expect to have $DRIVER_VERSION instead. Continuing"

                # Turn off persistent mode so that the installation script can unload the kernel module
                sudo killall nvidia-persistenced || true
            else
                HAS_NVIDIA_DRIVER=1
                echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has already been installed. Skipping NVIDIA driver installation"
            fi
            set -e
        fi

        if [ "$HAS_NVIDIA_DRIVER" -eq 0 ]; then
            # CAUTION: this may need to be updated in future
            if [ "${DISTRIBUTION}" != ubuntu20.04 ]; then
                  sudo yum groupinstall -y "Development Tools"
                  # ensure our kernel install is the same as our underlying kernel,
                  # groupinstall "Development Tools" has a habit of mismatching kernel headers
                  sudo yum install -y "kernel-devel-uname-r == $(uname -r)"
                  sudo modprobe backlight
            fi
            sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN"

            set +e
            sudo /bin/bash /tmp/nvidia_driver -s --no-drm
            NVIDIA_INSTALLATION_STATUS=$?

            RESET_GPU=0
            if [ "$NVIDIA_INSTALLATION_STATUS" -ne 0 ]; then
                sudo cat /var/log/nvidia-installer.log
                # Fail to install NVIDIA driver, try to reset the GPU
                RESET_GPU=1
            elif [ -x "$(command -v nvidia-smi)" ]; then
                # Check again if nvidia-smi works even if the driver installation completes successfully
                INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0)
                NVIDIA_SMI_STATUS=$?

                if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then
                    RESET_GPU=1
                fi
            fi

            if [ "$RESET_GPU" -eq 1 ]; then
                NVIDIA_DEVICES=$(lspci -D | grep -i NVIDIA | cut -d' ' -f1)
                # The GPU can get stuck in a failure state if somehow the test crashs the GPU microcode. When this
                # happens, we'll try to reset all NVIDIA devices https://github.com/pytorch/pytorch/issues/88388
                for PCI_ID in $NVIDIA_DEVICES; do
                    DEVICE_ENABLED=$(cat /sys/bus/pci/devices/$PCI_ID/enable)

                    echo "Reseting $PCI_ID (enabled state: $DEVICE_ENABLED)"
                    # This requires sudo permission of course
                    echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset
                    sleep 1
                done
            fi

            sudo rm -fv /tmp/nvidia_driver
            set -e
        fi
    )
}

post_install_nvidia_driver_common() {
    (
        sudo modprobe nvidia || true
        echo "After installing NVIDIA driver"
        lspci
        lsmod
        modinfo nvidia || true

        (
            set +e

            nvidia-smi
            # NB: Annoyingly, nvidia-smi command returns successfully with return code 0 even in
            # the case where the driver has already crashed as it still can get the driver version
            # and some basic information like the bus ID.  However, the rest of the information
            # would be missing (ERR!), for example:
            #
            # +-----------------------------------------------------------------------------+
            # | NVIDIA-SMI 525.89.02    Driver Version: 525.89.02    CUDA Version: 12.0     |
            # |-------------------------------+----------------------+----------------------+
            # | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
            # | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
            # |                               |                      |               MIG M. |
            # |===============================+======================+======================|
            # |   0  ERR!                Off  | 00000000:00:1E.0 Off |                 ERR! |
            # |ERR!  ERR! ERR!    ERR! / ERR! |   4184MiB / 23028MiB |    ERR!      Default |
            # |                               |                      |                 ERR! |
            # +-------------------------------+----------------------+----------------------+
            #
            # +-----------------------------------------------------------------------------+
            # | Processes:                                                                  |
            # |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
            # |        ID   ID                                                   Usage      |
            # |=============================================================================|
            # +-----------------------------------------------------------------------------+
            #
            # This should be reported as a failure instead as it will guarantee to fail when
            # Docker tries to run with --gpus all
            #
            # So, the correct check here is to query one of the missing piece of info like
            # GPU name, so that the command can fail accordingly
            nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0
            NVIDIA_SMI_STATUS=$?

            # Allowable exit statuses for nvidia-smi, see: https://github.com/NVIDIA/gpu-operator/issues/285
            if [ "$NVIDIA_SMI_STATUS" -eq 0 ] || [ "$NVIDIA_SMI_STATUS" -eq 14 ]; then
                echo "INFO: Ignoring allowed status ${NVIDIA_SMI_STATUS}"
            else
                echo "ERROR: nvidia-smi exited with unresolved status ${NVIDIA_SMI_STATUS}"
                exit ${NVIDIA_SMI_STATUS}
            fi
            set -e
        )
    )
}

install_nvidia_driver_amzn2() {
    (
        set -x
        pre_install_nvidia_driver_amzn2
        install_nvidia_driver_common
        post_install_nvidia_driver_common
    )
}

install_nvidia_driver_ubuntu20() {
    (
        set -x
        install_nvidia_driver_common
        post_install_nvidia_driver_common
    )
}

echo "== Installing nvidia driver ${DRIVER_FN} =="
case "${DISTRIBUTION}" in
    amzn*)
        install_nvidia_driver_amzn2
        ;;
    ubuntu20.04)
        install_nvidia_driver_ubuntu20
        ;;
    *)
        echo "ERROR: Unknown distribution ${DISTRIBUTION}"
        exit 1
        ;;
esac

# Install container toolkit based on distribution
echo "== Installing nvidia container toolkit for ${DISTRIBUTION} =="
case "${DISTRIBUTION}" in
    amzn*)
        install_nvidia_docker2_amzn2
        ;;
    ubuntu20.04)
        install_nvidia_docker2_ubuntu20
        ;;
    *)
        echo "ERROR: Unknown distribution ${DISTRIBUTION}"
        exit 1
        ;;
esac
echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}"

# Fix https://github.com/NVIDIA/nvidia-docker/issues/1648 on runners with
# more than one GPUs. This just needs to be run once. The command fails
# on subsequent runs and complains that the mode is already on, but that's
# ok
sudo nvidia-persistenced || true
# This should show persistence mode ON
nvidia-smi

2025-05-07T20:23:38.8947221Z   retry_wait_seconds: 10
2025-05-07T20:23:38.8947468Z   polling_interval_seconds: 1
2025-05-07T20:23:38.8947764Z   warning_on_retry: true
2025-05-07T20:23:38.8948003Z   continue_on_error: false
2025-05-07T20:23:38.8948237Z env:
2025-05-07T20:23:38.8948449Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:38.8948741Z   BUILD_ENV: build_binary
2025-05-07T20:23:38.8948990Z   BUILD_TARGET: genai
2025-05-07T20:23:38.8949206Z   BUILD_VARIANT: cuda
2025-05-07T20:23:38.8949435Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:23:38.8949677Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:38.8949911Z   DRIVER_VERSION: 570.133.07
2025-05-07T20:23:38.8950147Z ##[endgroup]
2025-05-07T20:23:38.9765844Z == Installing nvidia driver NVIDIA-Linux-x86_64-570.133.07.run ==
2025-05-07T20:23:38.9766864Z + pre_install_nvidia_driver_amzn2
2025-05-07T20:23:38.9770763Z + sudo yum remove -y nvidia-driver-latest-dkms
2025-05-07T20:23:39.3367889Z No match for argument: nvidia-driver-latest-dkms
2025-05-07T20:23:39.3368243Z No packages marked for removal.
2025-05-07T20:23:39.3431611Z Dependencies resolved.
2025-05-07T20:23:39.3443249Z Nothing to do.
2025-05-07T20:23:39.3443726Z Complete!
2025-05-07T20:23:39.3781147Z + install_nvidia_driver_common
2025-05-07T20:23:39.3785471Z + echo 'Before installing NVIDIA driver'
2025-05-07T20:23:39.3786616Z Before installing NVIDIA driver
2025-05-07T20:23:39.3788979Z + lspci
2025-05-07T20:23:39.3979544Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:23:39.3980622Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:23:39.3981188Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:23:39.3981689Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
2025-05-07T20:23:39.3982155Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller
2025-05-07T20:23:39.3982665Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:39.3983125Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:39.3983586Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
2025-05-07T20:23:39.3983977Z + lsmod
2025-05-07T20:23:39.4024359Z Module                  Size  Used by
2025-05-07T20:23:39.4025169Z xt_conntrack           16384  1
2025-05-07T20:23:39.4025974Z nft_chain_nat          16384  3
2025-05-07T20:23:39.4026631Z xt_MASQUERADE          20480  1
2025-05-07T20:23:39.4027032Z nf_nat                 57344  2 nft_chain_nat,xt_MASQUERADE
2025-05-07T20:23:39.4027442Z nf_conntrack_netlink    57344  0
2025-05-07T20:23:39.4027918Z nf_conntrack          184320  4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE
2025-05-07T20:23:39.4028344Z nf_defrag_ipv6         24576  1 nf_conntrack
2025-05-07T20:23:39.4028648Z nf_defrag_ipv4         16384  1 nf_conntrack
2025-05-07T20:23:39.4028932Z xfrm_user              57344  1
2025-05-07T20:23:39.4029192Z xfrm_algo              16384  1 xfrm_user
2025-05-07T20:23:39.4029472Z xt_addrtype            16384  2
2025-05-07T20:23:39.4029725Z nft_compat             20480  4
2025-05-07T20:23:39.4030011Z nf_tables             311296  57 nft_compat,nft_chain_nat
2025-05-07T20:23:39.4030409Z nfnetlink              20480  4 nft_compat,nf_conntrack_netlink,nf_tables
2025-05-07T20:23:39.4030765Z br_netfilter           36864  0
2025-05-07T20:23:39.4031243Z bridge                323584  1 br_netfilter
2025-05-07T20:23:39.4031532Z stp                    16384  1 bridge
2025-05-07T20:23:39.4031810Z llc                    16384  2 bridge,stp
2025-05-07T20:23:39.4032087Z overlay               167936  0
2025-05-07T20:23:39.4032323Z tls                   135168  0
2025-05-07T20:23:39.4032561Z nls_ascii              16384  1
2025-05-07T20:23:39.4032806Z nls_cp437              20480  1
2025-05-07T20:23:39.4033044Z vfat                   24576  1
2025-05-07T20:23:39.4033288Z fat                    86016  1 vfat
2025-05-07T20:23:39.4033543Z ena                   180224  0
2025-05-07T20:23:39.4033773Z sunrpc                696320  1
2025-05-07T20:23:39.4034014Z i8042                  45056  0
2025-05-07T20:23:39.4034260Z serio                  28672  3 i8042
2025-05-07T20:23:39.4034525Z ghash_clmulni_intel    16384  0
2025-05-07T20:23:39.4034778Z button                 24576  0
2025-05-07T20:23:39.4035025Z sch_fq_codel           20480  17
2025-05-07T20:23:39.4035276Z dm_mod                188416  0
2025-05-07T20:23:39.4035518Z dax                    45056  1 dm_mod
2025-05-07T20:23:39.4035787Z loop                   36864  0
2025-05-07T20:23:39.4036020Z fuse                  163840  1
2025-05-07T20:23:39.4036266Z configfs               57344  1
2025-05-07T20:23:39.4036506Z dmi_sysfs              20480  0
2025-05-07T20:23:39.4036749Z crc32_pclmul           16384  0
2025-05-07T20:23:39.4036993Z crc32c_intel           24576  0
2025-05-07T20:23:39.4037284Z efivarfs               24576  1
2025-05-07T20:23:39.4037527Z + modinfo nvidia
2025-05-07T20:23:39.4044110Z filename:       /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko
2025-05-07T20:23:39.4044769Z import_ns:      DMA_BUF
2025-05-07T20:23:39.4045095Z alias:          char-major-195-*
2025-05-07T20:23:39.4045437Z version:        570.133.07
2025-05-07T20:23:39.4045675Z supported:      external
2025-05-07T20:23:39.4045917Z license:        Dual MIT/GPL
2025-05-07T20:23:39.4046205Z firmware:       nvidia/570.133.07/gsp_tu10x.bin
2025-05-07T20:23:39.4046555Z firmware:       nvidia/570.133.07/gsp_ga10x.bin
2025-05-07T20:23:39.4047065Z srcversion:     49515739FD8F721A3F2F714
2025-05-07T20:23:39.4047384Z alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
2025-05-07T20:23:39.4047806Z alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
2025-05-07T20:23:39.4048150Z alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
2025-05-07T20:23:39.4048451Z depends:        i2c-core,drm
2025-05-07T20:23:39.4048701Z retpoline:      Y
2025-05-07T20:23:39.4048905Z name:           nvidia
2025-05-07T20:23:39.4049253Z vermagic:       6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 
2025-05-07T20:23:39.4049706Z parm:           NvSwitchRegDwords:NvSwitch regkey (charp)
2025-05-07T20:23:39.4050141Z parm:           NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp)
2025-05-07T20:23:39.4050536Z parm:           NVreg_ResmanDebugLevel:int
2025-05-07T20:23:39.4050832Z parm:           NVreg_RmLogonRC:int
2025-05-07T20:23:39.4051127Z parm:           NVreg_ModifyDeviceFiles:int
2025-05-07T20:23:39.4051430Z parm:           NVreg_DeviceFileUID:int
2025-05-07T20:23:39.4051725Z parm:           NVreg_DeviceFileGID:int
2025-05-07T20:23:39.4052018Z parm:           NVreg_DeviceFileMode:int
2025-05-07T20:23:39.4052365Z parm:           NVreg_InitializeSystemMemoryAllocations:int
2025-05-07T20:23:39.4052757Z parm:           NVreg_UsePageAttributeTable:int
2025-05-07T20:23:39.4053079Z parm:           NVreg_EnablePCIeGen3:int
2025-05-07T20:23:39.4053366Z parm:           NVreg_EnableMSI:int
2025-05-07T20:23:39.4053653Z parm:           NVreg_EnableStreamMemOPs:int
2025-05-07T20:23:39.4054001Z parm:           NVreg_RestrictProfilingToAdminUsers:int
2025-05-07T20:23:39.4054385Z parm:           NVreg_PreserveVideoMemoryAllocations:int
2025-05-07T20:23:39.4054744Z parm:           NVreg_EnableS0ixPowerManagement:int
2025-05-07T20:23:39.4055144Z parm:           NVreg_S0ixPowerManagementVideoMemoryThreshold:int
2025-05-07T20:23:39.4055681Z parm:           NVreg_DynamicPowerManagement:int
2025-05-07T20:23:39.4056085Z parm:           NVreg_DynamicPowerManagementVideoMemoryThreshold:int
2025-05-07T20:23:39.4056505Z parm:           NVreg_EnableGpuFirmware:int
2025-05-07T20:23:39.4056916Z parm:           NVreg_EnableGpuFirmwareLogs:int
2025-05-07T20:23:39.4057276Z parm:           NVreg_OpenRmEnableUnsupportedGpus:int
2025-05-07T20:23:39.4057628Z parm:           NVreg_EnableUserNUMAManagement:int
2025-05-07T20:23:39.4057959Z parm:           NVreg_MemoryPoolSize:int
2025-05-07T20:23:39.4058269Z parm:           NVreg_KMallocHeapMaxSize:int
2025-05-07T20:23:39.4058581Z parm:           NVreg_VMallocHeapMaxSize:int
2025-05-07T20:23:39.4058936Z parm:           NVreg_IgnoreMMIOCheck:int
2025-05-07T20:23:39.4059275Z parm:           NVreg_NvLinkDisable:int
2025-05-07T20:23:39.4059608Z parm:           NVreg_EnablePCIERelaxedOrderingMode:int
2025-05-07T20:23:39.4060012Z parm:           NVreg_RegisterPCIDriver:int
2025-05-07T20:23:39.4060387Z parm:           NVreg_EnableResizableBar:int
2025-05-07T20:23:39.4060712Z parm:           NVreg_EnableDbgBreakpoint:int
2025-05-07T20:23:39.4061044Z parm:           NVreg_EnableNonblockingOpen:int
2025-05-07T20:23:39.4061372Z parm:           NVreg_RegistryDwords:charp
2025-05-07T20:23:39.4061708Z parm:           NVreg_RegistryDwordsPerDevice:charp
2025-05-07T20:23:39.4062031Z parm:           NVreg_RmMsg:charp
2025-05-07T20:23:39.4062305Z parm:           NVreg_GpuBlacklist:charp
2025-05-07T20:23:39.4062695Z parm:           NVreg_TemporaryFilePath:charp
2025-05-07T20:23:39.4063045Z parm:           NVreg_ExcludedGpus:charp
2025-05-07T20:23:39.4063347Z parm:           NVreg_DmaRemapPeerMmio:int
2025-05-07T20:23:39.4063671Z parm:           NVreg_RmNvlinkBandwidth:charp
2025-05-07T20:23:39.4064011Z parm:           NVreg_RmNvlinkBandwidthLinkCount:int
2025-05-07T20:23:39.4064346Z parm:           NVreg_ImexChannelCount:int
2025-05-07T20:23:39.4064666Z parm:           NVreg_CreateImexChannel0:int
2025-05-07T20:23:39.4065009Z parm:           NVreg_GrdmaPciTopoCheckOverride:int
2025-05-07T20:23:39.4065341Z parm:           rm_firmware_active:charp
2025-05-07T20:23:39.4065736Z + HAS_NVIDIA_DRIVER=0
2025-05-07T20:23:39.4065975Z ++ command -v nvidia-smi
2025-05-07T20:23:39.4066228Z + '[' -x /usr/bin/nvidia-smi ']'
2025-05-07T20:23:39.4066472Z + set +e
2025-05-07T20:23:39.4066794Z ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0
2025-05-07T20:23:41.2269963Z + INSTALLED_DRIVER_VERSION=570.133.07
2025-05-07T20:23:41.2270307Z + NVIDIA_SMI_STATUS=0
2025-05-07T20:23:41.2270610Z + '[' 0 -ne 0 ']'
2025-05-07T20:23:41.2270816Z + '[' 570.133.07 '!=' 570.133.07 ']'
2025-05-07T20:23:41.2271089Z + HAS_NVIDIA_DRIVER=1
2025-05-07T20:23:41.2271509Z + echo 'NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation'
2025-05-07T20:23:41.2271954Z + set -e
2025-05-07T20:23:41.2272141Z + '[' 1 -eq 0 ']'
2025-05-07T20:23:41.2272516Z NVIDIA driver (570.133.07) has already been installed. Skipping NVIDIA driver installation
2025-05-07T20:23:41.2272997Z + post_install_nvidia_driver_common
2025-05-07T20:23:41.2276109Z + sudo modprobe nvidia
2025-05-07T20:23:41.3138084Z + echo 'After installing NVIDIA driver'
2025-05-07T20:23:41.3138402Z + lspci
2025-05-07T20:23:41.3138606Z After installing NVIDIA driver
2025-05-07T20:23:41.3257047Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:23:41.3257525Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:23:41.3258076Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:23:41.3258580Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
2025-05-07T20:23:41.3259039Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller
2025-05-07T20:23:41.3259547Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:41.3260021Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:41.3260764Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
2025-05-07T20:23:41.3261154Z + lsmod
2025-05-07T20:23:41.3288525Z Module                  Size  Used by
2025-05-07T20:23:41.3288824Z nvidia_uvm           1884160  0
2025-05-07T20:23:41.3289108Z nvidia              11583488  1 nvidia_uvm
2025-05-07T20:23:41.3289383Z drm                   602112  1 nvidia
2025-05-07T20:23:41.3289675Z drm_panel_orientation_quirks    32768  1 drm
2025-05-07T20:23:41.3289973Z backlight              24576  1 drm
2025-05-07T20:23:41.3290250Z i2c_core              110592  2 nvidia,drm
2025-05-07T20:23:41.3290529Z xt_conntrack           16384  1
2025-05-07T20:23:41.3290786Z nft_chain_nat          16384  3
2025-05-07T20:23:41.3291036Z xt_MASQUERADE          20480  1
2025-05-07T20:23:41.3291317Z nf_nat                 57344  2 nft_chain_nat,xt_MASQUERADE
2025-05-07T20:23:41.3291636Z nf_conntrack_netlink    57344  0
2025-05-07T20:23:41.3292015Z nf_conntrack          184320  4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE
2025-05-07T20:23:41.3292431Z nf_defrag_ipv6         24576  1 nf_conntrack
2025-05-07T20:23:41.3292737Z nf_defrag_ipv4         16384  1 nf_conntrack
2025-05-07T20:23:41.3293022Z xfrm_user              57344  1
2025-05-07T20:23:41.3293273Z xfrm_algo              16384  1 xfrm_user
2025-05-07T20:23:41.3293552Z xt_addrtype            16384  2
2025-05-07T20:23:41.3293794Z nft_compat             20480  4
2025-05-07T20:23:41.3294083Z nf_tables             311296  57 nft_compat,nft_chain_nat
2025-05-07T20:23:41.3294470Z nfnetlink              20480  4 nft_compat,nf_conntrack_netlink,nf_tables
2025-05-07T20:23:41.3294829Z br_netfilter           36864  0
2025-05-07T20:23:41.3295098Z bridge                323584  1 br_netfilter
2025-05-07T20:23:41.3295373Z stp                    16384  1 bridge
2025-05-07T20:23:41.3295647Z llc                    16384  2 bridge,stp
2025-05-07T20:23:41.3295915Z overlay               167936  0
2025-05-07T20:23:41.3296150Z tls                   135168  0
2025-05-07T20:23:41.3296397Z nls_ascii              16384  1
2025-05-07T20:23:41.3296809Z nls_cp437              20480  1
2025-05-07T20:23:41.3297044Z vfat                   24576  1
2025-05-07T20:23:41.3297284Z fat                    86016  1 vfat
2025-05-07T20:23:41.3297534Z ena                   180224  0
2025-05-07T20:23:41.3297774Z sunrpc                696320  1
2025-05-07T20:23:41.3298006Z i8042                  45056  0
2025-05-07T20:23:41.3298251Z serio                  28672  3 i8042
2025-05-07T20:23:41.3298519Z ghash_clmulni_intel    16384  0
2025-05-07T20:23:41.3298760Z button                 24576  0
2025-05-07T20:23:41.3299011Z sch_fq_codel           20480  17
2025-05-07T20:23:41.3299259Z dm_mod                188416  0
2025-05-07T20:23:41.3299493Z dax                    45056  1 dm_mod
2025-05-07T20:23:41.3299755Z loop                   36864  0
2025-05-07T20:23:41.3299993Z fuse                  163840  1
2025-05-07T20:23:41.3300226Z configfs               57344  1
2025-05-07T20:23:41.3300475Z dmi_sysfs              20480  0
2025-05-07T20:23:41.3300720Z crc32_pclmul           16384  0
2025-05-07T20:23:41.3300961Z crc32c_intel           24576  0
2025-05-07T20:23:41.3301204Z efivarfs               24576  1
2025-05-07T20:23:41.3301447Z + modinfo nvidia
2025-05-07T20:23:41.3305829Z filename:       /lib/modules/6.1.130-139.222.amzn2023.x86_64/kernel/drivers/video/nvidia.ko
2025-05-07T20:23:41.3306279Z import_ns:      DMA_BUF
2025-05-07T20:23:41.3306526Z alias:          char-major-195-*
2025-05-07T20:23:41.3306794Z version:        570.133.07
2025-05-07T20:23:41.3307030Z supported:      external
2025-05-07T20:23:41.3307274Z license:        Dual MIT/GPL
2025-05-07T20:23:41.3307644Z firmware:       nvidia/570.133.07/gsp_tu10x.bin
2025-05-07T20:23:41.3307999Z firmware:       nvidia/570.133.07/gsp_ga10x.bin
2025-05-07T20:23:41.3308301Z srcversion:     49515739FD8F721A3F2F714
2025-05-07T20:23:41.3308613Z alias:          pci:v000010DEd*sv*sd*bc06sc80i00*
2025-05-07T20:23:41.3308937Z alias:          pci:v000010DEd*sv*sd*bc03sc02i00*
2025-05-07T20:23:41.3309393Z alias:          pci:v000010DEd*sv*sd*bc03sc00i00*
2025-05-07T20:23:41.3309698Z depends:        i2c-core,drm
2025-05-07T20:23:41.3309945Z retpoline:      Y
2025-05-07T20:23:41.3310149Z name:           nvidia
2025-05-07T20:23:41.3310503Z vermagic:       6.1.130-139.222.amzn2023.x86_64 SMP preempt mod_unload modversions 
2025-05-07T20:23:41.3310960Z parm:           NvSwitchRegDwords:NvSwitch regkey (charp)
2025-05-07T20:23:41.3311395Z parm:           NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp)
2025-05-07T20:23:41.3311805Z parm:           NVreg_ResmanDebugLevel:int
2025-05-07T20:23:41.3312100Z parm:           NVreg_RmLogonRC:int
2025-05-07T20:23:41.3312392Z parm:           NVreg_ModifyDeviceFiles:int
2025-05-07T20:23:41.3312689Z parm:           NVreg_DeviceFileUID:int
2025-05-07T20:23:41.3312991Z parm:           NVreg_DeviceFileGID:int
2025-05-07T20:23:41.3313319Z parm:           NVreg_DeviceFileMode:int
2025-05-07T20:23:41.3313806Z parm:           NVreg_InitializeSystemMemoryAllocations:int
2025-05-07T20:23:41.3314209Z parm:           NVreg_UsePageAttributeTable:int
2025-05-07T20:23:41.3314531Z parm:           NVreg_EnablePCIeGen3:int
2025-05-07T20:23:41.3314816Z parm:           NVreg_EnableMSI:int
2025-05-07T20:23:41.3315107Z parm:           NVreg_EnableStreamMemOPs:int
2025-05-07T20:23:41.3315453Z parm:           NVreg_RestrictProfilingToAdminUsers:int
2025-05-07T20:23:41.3315883Z parm:           NVreg_PreserveVideoMemoryAllocations:int
2025-05-07T20:23:41.3316411Z parm:           NVreg_EnableS0ixPowerManagement:int
2025-05-07T20:23:41.3316902Z parm:           NVreg_S0ixPowerManagementVideoMemoryThreshold:int
2025-05-07T20:23:41.3317293Z parm:           NVreg_DynamicPowerManagement:int
2025-05-07T20:23:41.3317690Z parm:           NVreg_DynamicPowerManagementVideoMemoryThreshold:int
2025-05-07T20:23:41.3318086Z parm:           NVreg_EnableGpuFirmware:int
2025-05-07T20:23:41.3318412Z parm:           NVreg_EnableGpuFirmwareLogs:int
2025-05-07T20:23:41.3318765Z parm:           NVreg_OpenRmEnableUnsupportedGpus:int
2025-05-07T20:23:41.3319231Z parm:           NVreg_EnableUserNUMAManagement:int
2025-05-07T20:23:41.3319565Z parm:           NVreg_MemoryPoolSize:int
2025-05-07T20:23:41.3319876Z parm:           NVreg_KMallocHeapMaxSize:int
2025-05-07T20:23:41.3320190Z parm:           NVreg_VMallocHeapMaxSize:int
2025-05-07T20:23:41.3320501Z parm:           NVreg_IgnoreMMIOCheck:int
2025-05-07T20:23:41.3320800Z parm:           NVreg_NvLinkDisable:int
2025-05-07T20:23:41.3321131Z parm:           NVreg_EnablePCIERelaxedOrderingMode:int
2025-05-07T20:23:41.3321479Z parm:           NVreg_RegisterPCIDriver:int
2025-05-07T20:23:41.3321800Z parm:           NVreg_EnableResizableBar:int
2025-05-07T20:23:41.3322117Z parm:           NVreg_EnableDbgBreakpoint:int
2025-05-07T20:23:41.3322454Z parm:           NVreg_EnableNonblockingOpen:int
2025-05-07T20:23:41.3322778Z parm:           NVreg_RegistryDwords:charp
2025-05-07T20:23:41.3323116Z parm:           NVreg_RegistryDwordsPerDevice:charp
2025-05-07T20:23:41.3323437Z parm:           NVreg_RmMsg:charp
2025-05-07T20:23:41.3323721Z parm:           NVreg_GpuBlacklist:charp
2025-05-07T20:23:41.3324039Z parm:           NVreg_TemporaryFilePath:charp
2025-05-07T20:23:41.3324344Z parm:           NVreg_ExcludedGpus:charp
2025-05-07T20:23:41.3324645Z parm:           NVreg_DmaRemapPeerMmio:int
2025-05-07T20:23:41.3324964Z parm:           NVreg_RmNvlinkBandwidth:charp
2025-05-07T20:23:41.3325299Z parm:           NVreg_RmNvlinkBandwidthLinkCount:int
2025-05-07T20:23:41.3325692Z parm:           NVreg_ImexChannelCount:int
2025-05-07T20:23:41.3326004Z parm:           NVreg_CreateImexChannel0:int
2025-05-07T20:23:41.3326337Z parm:           NVreg_GrdmaPciTopoCheckOverride:int
2025-05-07T20:23:41.3326667Z parm:           rm_firmware_active:charp
2025-05-07T20:23:41.3326931Z + set +e
2025-05-07T20:23:41.3327123Z + nvidia-smi
2025-05-07T20:23:42.7335099Z Wed May  7 20:23:42 2025       
2025-05-07T20:23:42.7335498Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:42.7336363Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:42.7336850Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:42.7337332Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:42.7337846Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:42.7338255Z |                                         |                        |               MIG M. |
2025-05-07T20:23:42.7338584Z |=========================================+========================+======================|
2025-05-07T20:23:42.7399026Z |   0  NVIDIA A10G                    Off |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:42.7399479Z |  0%   28C    P0             64W /  300W |       0MiB /  23028MiB |      4%      Default |
2025-05-07T20:23:42.7399869Z |                                         |                        |                  N/A |
2025-05-07T20:23:42.7400248Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:42.7400636Z                                                                                          
2025-05-07T20:23:42.7401026Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:42.7401442Z | Processes:                                                                              |
2025-05-07T20:23:42.7401882Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:42.7402282Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:42.7402617Z |=========================================================================================|
2025-05-07T20:23:42.7403827Z |  No running processes found                                                             |
2025-05-07T20:23:42.7404480Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:43.1486845Z + nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0
2025-05-07T20:23:44.5566654Z NVIDIA A10G
2025-05-07T20:23:44.8318498Z + NVIDIA_SMI_STATUS=0
2025-05-07T20:23:44.8318858Z + '[' 0 -eq 0 ']'
2025-05-07T20:23:44.8319207Z + echo 'INFO: Ignoring allowed status 0'
2025-05-07T20:23:44.8319614Z + set -e
2025-05-07T20:23:44.8319809Z INFO: Ignoring allowed status 0
2025-05-07T20:23:44.8328583Z == Installing nvidia container toolkit for amzn2023 ==
2025-05-07T20:23:44.8331664Z + sudo yum install -y yum-utils
2025-05-07T20:23:45.2423230Z Last metadata expiration check: 0:06:17 ago on Wed May  7 20:17:28 2025.
2025-05-07T20:23:45.2677959Z Package dnf-utils-4.3.0-13.amzn2023.0.5.noarch is already installed.
2025-05-07T20:23:45.3068069Z Dependencies resolved.
2025-05-07T20:23:45.3247619Z Nothing to do.
2025-05-07T20:23:45.3247939Z Complete!
2025-05-07T20:23:45.3638848Z + [[ amzn2023 == \a\m\z\n\2\0\2\3 ]]
2025-05-07T20:23:45.3639562Z + YUM_REPO_URL=https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:45.3640649Z + sudo yum-config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:45.7146343Z Adding repo from: https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo
2025-05-07T20:23:45.7693936Z + sudo yum install -y nvidia-docker2 nvidia-container-toolkit-1.16.2
2025-05-07T20:23:46.3677970Z nvidia-container-toolkit                         12 kB/s | 833  B     00:00    
2025-05-07T20:23:46.3924030Z Package nvidia-docker2-2.14.0-1.noarch is already installed.
2025-05-07T20:23:46.4321888Z Dependencies resolved.
2025-05-07T20:23:46.4499610Z ================================================================================
2025-05-07T20:23:46.4500383Z  Package                       Arch   Version    Repository                Size
2025-05-07T20:23:46.4500767Z ================================================================================
2025-05-07T20:23:46.4511542Z Downgrading:
2025-05-07T20:23:46.4511954Z  nvidia-container-toolkit      x86_64 1.16.2-1   nvidia-container-toolkit 1.2 M
2025-05-07T20:23:46.4512539Z  nvidia-container-toolkit-base x86_64 1.16.2-1   nvidia-container-toolkit 5.6 M
2025-05-07T20:23:46.4512885Z 
2025-05-07T20:23:46.4512975Z Transaction Summary
2025-05-07T20:23:46.4513223Z ================================================================================
2025-05-07T20:23:46.4513535Z Downgrade  2 Packages
2025-05-07T20:23:46.4513680Z 
2025-05-07T20:23:46.4513787Z Total download size: 6.8 M
2025-05-07T20:23:46.4514045Z Downloading Packages:
2025-05-07T20:23:46.4932432Z (1/2): nvidia-container-toolkit-1.16.2-1.x86_64  30 MB/s | 1.2 MB     00:00    
2025-05-07T20:23:46.5445228Z (2/2): nvidia-container-toolkit-base-1.16.2-1.x  60 MB/s | 5.6 MB     00:00    
2025-05-07T20:23:46.5453640Z --------------------------------------------------------------------------------
2025-05-07T20:23:46.5456877Z Total                                            72 MB/s | 6.8 MB     00:00     
2025-05-07T20:23:46.5459059Z Running transaction check
2025-05-07T20:23:46.5560014Z Transaction check succeeded.
2025-05-07T20:23:46.5560618Z Running transaction test
2025-05-07T20:23:46.5852278Z Transaction test succeeded.
2025-05-07T20:23:46.5854809Z Running transaction
2025-05-07T20:23:47.1315758Z   Preparing        :                                                        1/1 
2025-05-07T20:23:47.2367717Z   Downgrading      : nvidia-container-toolkit-base-1.16.2-1.x86_64          1/4 
2025-05-07T20:23:47.2395057Z   Downgrading      : nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:47.2601773Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:47.2602340Z   Cleanup          : nvidia-container-toolkit-1.17.6-1.x86_64               3/4 
2025-05-07T20:23:47.2706058Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               3/4 
2025-05-07T20:23:47.2732740Z   Cleanup          : nvidia-container-toolkit-base-1.17.6-1.x86_64          4/4 
2025-05-07T20:23:47.4514586Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               4/4 
2025-05-07T20:23:47.4515164Z   Verifying        : nvidia-container-toolkit-1.16.2-1.x86_64               1/4 
2025-05-07T20:23:47.4515689Z   Verifying        : nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:47.4516203Z   Verifying        : nvidia-container-toolkit-base-1.16.2-1.x86_64          3/4 
2025-05-07T20:23:47.5910183Z   Verifying        : nvidia-container-toolkit-base-1.17.6-1.x86_64          4/4================================================================================
2025-05-07T20:23:47.5911204Z WARNING:
2025-05-07T20:23:47.5911556Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:47.5911965Z 
2025-05-07T20:23:47.5912087Z   Available Versions:
2025-05-07T20:23:47.5912292Z 
2025-05-07T20:23:47.5912428Z   Version 2023.7.20250331:
2025-05-07T20:23:47.5912747Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:47.5912996Z 
2025-05-07T20:23:47.5913116Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:47.5913321Z 
2025-05-07T20:23:47.5913412Z     Release notes:
2025-05-07T20:23:47.5913812Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:47.5914193Z 
2025-05-07T20:23:47.5914280Z   Version 2023.7.20250414:
2025-05-07T20:23:47.5914575Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:47.5914821Z 
2025-05-07T20:23:47.5914937Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:47.5915136Z 
2025-05-07T20:23:47.5915218Z     Release notes:
2025-05-07T20:23:47.5915603Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:47.5916272Z 
2025-05-07T20:23:47.5916360Z   Version 2023.7.20250428:
2025-05-07T20:23:47.5916658Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:47.5916896Z 
2025-05-07T20:23:47.5917008Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:47.5917219Z 
2025-05-07T20:23:47.5917301Z     Release notes:
2025-05-07T20:23:47.5917677Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:47.5918043Z 
2025-05-07T20:23:47.5918159Z ================================================================================
2025-05-07T20:23:47.6262415Z  
2025-05-07T20:23:47.6262615Z 
2025-05-07T20:23:47.6262705Z Downgraded:
2025-05-07T20:23:47.6263057Z   nvidia-container-toolkit-1.16.2-1.x86_64                                      
2025-05-07T20:23:47.6263624Z   nvidia-container-toolkit-base-1.16.2-1.x86_64                                 
2025-05-07T20:23:47.6263966Z 
2025-05-07T20:23:47.6264045Z Complete!
2025-05-07T20:23:47.6718490Z + sudo systemctl restart docker
2025-05-07T20:23:51.1286891Z Wed May  7 20:23:51 2025       
2025-05-07T20:23:51.1287366Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:51.1287862Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:51.1288343Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:51.1288822Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:51.1289346Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:51.1289769Z |                                         |                        |               MIG M. |
2025-05-07T20:23:51.1290096Z |=========================================+========================+======================|
2025-05-07T20:23:51.1373422Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:51.1374225Z |  0%   28C    P0             64W /  300W |       0MiB /  23028MiB |      4%      Default |
2025-05-07T20:23:51.1374616Z |                                         |                        |                  N/A |
2025-05-07T20:23:51.1374998Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:51.1375380Z                                                                                          
2025-05-07T20:23:51.1375874Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:51.1376293Z | Processes:                                                                              |
2025-05-07T20:23:51.1376729Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:51.1377128Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:51.1377462Z |=========================================================================================|
2025-05-07T20:23:51.1379033Z |  No running processes found                                                             |
2025-05-07T20:23:51.1379504Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:51.9517330Z Command completed after 1 attempt(s).
2025-05-07T20:23:51.9601437Z ##[group]Run . $PRELUDE; print_system_info; print_ec2_info
2025-05-07T20:23:51.9601885Z [36;1m. $PRELUDE; print_system_info; print_ec2_info[0m
2025-05-07T20:23:51.9615985Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:51.9616331Z env:
2025-05-07T20:23:51.9616554Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:51.9616846Z   BUILD_ENV: build_binary
2025-05-07T20:23:51.9617091Z   BUILD_TARGET: genai
2025-05-07T20:23:51.9617320Z   BUILD_VARIANT: cuda
2025-05-07T20:23:51.9617546Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:23:51.9617980Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:51.9618277Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:51.9618600Z ##[endgroup]
2025-05-07T20:23:52.2978908Z ################################################################################
2025-05-07T20:23:52.2979364Z # Print System Info
2025-05-07T20:23:52.2979647Z #
2025-05-07T20:23:52.2995811Z # [2025-05-07T20:23:52.299Z] + print_system_info 
2025-05-07T20:23:52.2996287Z ################################################################################
2025-05-07T20:23:52.2996574Z 
2025-05-07T20:23:52.2996718Z ################################################################################
2025-05-07T20:23:52.2997086Z [INFO] Printing environment variables ...
2025-05-07T20:23:52.2997379Z + printenv
2025-05-07T20:23:52.2997493Z 
2025-05-07T20:23:52.3019380Z SHELL=/bin/bash
2025-05-07T20:23:52.3019912Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:23:52.3020794Z BUILD_VARIANT=cuda
2025-05-07T20:23:52.3022218Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_9a7d7a3a-ddc7-4928-87b7-ac501d01f089
2025-05-07T20:23:52.3023734Z GITHUB_ACTION=__run
2025-05-07T20:23:52.3024310Z GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:52.3024972Z GITHUB_RUN_NUMBER=10601
2025-05-07T20:23:52.3025441Z RUNNER_NAME=i-061cb0426579ace80
2025-05-07T20:23:52.3025961Z GITHUB_REPOSITORY_OWNER_ID=21003710
2025-05-07T20:23:52.3026532Z PLATFORM_NAME_LC=linux-x86_64
2025-05-07T20:23:52.3027032Z MACHINE_NAME_LC=x86_64
2025-05-07T20:23:52.3027908Z ACTIONS_RUNNER_HOOK_JOB_COMPLETED=/home/ec2-user/runner-scripts/after_job.sh
2025-05-07T20:23:52.3028713Z GITHUB_TRIGGERING_ACTOR=q10
2025-05-07T20:23:52.3029236Z PRELUDE=.github/scripts/setup_env.bash
2025-05-07T20:23:52.3029788Z GITHUB_REF_TYPE=branch
2025-05-07T20:23:52.3030228Z ***
2025-05-07T20:23:52.3030403Z LOGNAME=ec2-user
2025-05-07T20:23:52.3030628Z GITHUB_REPOSITORY_ID=150154628
2025-05-07T20:23:52.3030883Z ENFORCE_CUDA_DEVICE=1
2025-05-07T20:23:52.3031107Z GITHUB_ACTIONS=true
2025-05-07T20:23:52.3031320Z SYSTEMD_EXEC_PID=55553
2025-05-07T20:23:52.3031586Z GITHUB_SHA=a2f4c52051596e74bc8c16e3d2867a4ecdd271e0
2025-05-07T20:23:52.3032107Z GITHUB_WORKFLOW_REF=pytorch/FBGEMM/.github/workflows/fbgemm_gpu_ci_cuda.yml@refs/pull/4066/merge
2025-05-07T20:23:52.3032607Z RUNNER_ENVIRONMENT=self-hosted
2025-05-07T20:23:52.3032872Z GITHUB_REF=refs/pull/4066/merge
2025-05-07T20:23:52.3033117Z RUNNER_OS=Linux
2025-05-07T20:23:52.3033322Z GITHUB_REF_PROTECTED=false
2025-05-07T20:23:52.3033556Z HOME=/home/ec2-user
2025-05-07T20:23:52.3033802Z GITHUB_API_URL=https://api.github.com
2025-05-07T20:23:52.3034081Z LANG=C.UTF-8
2025-05-07T20:23:52.3034371Z RUNNER_TRACKING_ID=github_442cc8eb-c0ed-4196-93e2-da6db7d8b0a7
2025-05-07T20:23:52.3034720Z RUNNER_ARCH=X64
2025-05-07T20:23:52.3034977Z RUNNER_TEMP=/home/ec2-user/actions-runner/_work/_temp
2025-05-07T20:23:52.3035293Z BUILD_TARGET=genai
2025-05-07T20:23:52.3035803Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_9a7d7a3a-ddc7-4928-87b7-ac501d01f089
2025-05-07T20:23:52.3036647Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_9a7d7a3a-ddc7-4928-87b7-ac501d01f089
2025-05-07T20:23:52.3037359Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json
2025-05-07T20:23:52.3038245Z INVOCATION_ID=a688399bf0f247da91f959ddef8510d2
2025-05-07T20:23:52.3038697Z GITHUB_EVENT_NAME=pull_request
2025-05-07T20:23:52.3039034Z GITHUB_RUN_ID=14891846252
2025-05-07T20:23:52.3039748Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_9a7d7a3a-ddc7-4928-87b7-ac501d01f089
2025-05-07T20:23:52.3040612Z BUILD_ENV=build_binary
2025-05-07T20:23:52.3040824Z GITHUB_ACTOR=q10
2025-05-07T20:23:52.3041084Z GITHUB_RUN_ATTEMPT=1
2025-05-07T20:23:52.3041390Z KERN_NAME_LC=linux
2025-05-07T20:23:52.3041672Z BUILD_CUDA_VERSION=12.8.0
2025-05-07T20:23:52.3042064Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql
2025-05-07T20:23:52.3042768Z PLATFORM_NAME=Linux-x86_64
2025-05-07T20:23:52.3043108Z USER=ec2-user
2025-05-07T20:23:52.3043369Z GITHUB_SERVER_URL=https://github.com
2025-05-07T20:23:52.3043634Z SHLVL=1
2025-05-07T20:23:52.3043818Z GITHUB_ACTOR_ID=255046
2025-05-07T20:23:52.3044109Z RUNNER_TOOL_CACHE=/home/ec2-user/actions-runner/_work/_tool
2025-05-07T20:23:52.3044541Z GITHUB_WORKFLOW_SHA=6060cd4b5f971680caecdcc657faccb5720d1c3e
2025-05-07T20:23:52.3044894Z GITHUB_REF_NAME=4066/merge
2025-05-07T20:23:52.3045117Z KERN_NAME=Linux
2025-05-07T20:23:52.3045338Z GITHUB_JOB=test_and_publish_artifact
2025-05-07T20:23:52.3045732Z ACTIONS_RUNNER_HOOK_JOB_STARTED=/home/ec2-user/runner-scripts/before_job.sh
2025-05-07T20:23:52.3046139Z GITHUB_REPOSITORY=pytorch/FBGEMM
2025-05-07T20:23:52.3046401Z GITHUB_RETENTION_DAYS=90
2025-05-07T20:23:52.3046632Z JOURNAL_STREAM=8:96283
2025-05-07T20:23:52.3046927Z RUNNER_WORKSPACE=/home/ec2-user/actions-runner/_work/FBGEMM
2025-05-07T20:23:52.3047285Z GITHUB_ACTION_REPOSITORY=
2025-05-07T20:23:52.3047592Z PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
2025-05-07T20:23:52.3047919Z GITHUB_BASE_REF=main
2025-05-07T20:23:52.3048121Z CI=true
2025-05-07T20:23:52.3048323Z GITHUB_REPOSITORY_OWNER=pytorch
2025-05-07T20:23:52.3048597Z GITHUB_HEAD_REF=bm/genai-rocm-oss-6
2025-05-07T20:23:52.3048852Z GITHUB_ACTION_REF=
2025-05-07T20:23:52.3049089Z GITHUB_WORKFLOW=FBGEMM GPU/GenAI CUDA CI
2025-05-07T20:23:52.3049708Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_9a7d7a3a-ddc7-4928-87b7-ac501d01f089
2025-05-07T20:23:52.3050293Z MACHINE_NAME=x86_64
2025-05-07T20:23:52.3050505Z _=/usr/bin/printenv
2025-05-07T20:23:52.3050632Z 
2025-05-07T20:23:52.3050750Z ################################################################################
2025-05-07T20:23:52.3051045Z [INFO] Print ldd version ...
2025-05-07T20:23:52.3051287Z + ldd --version
2025-05-07T20:23:52.3051415Z 
2025-05-07T20:23:52.3051512Z ldd (GNU libc) 2.34
2025-05-07T20:23:52.3051779Z Copyright (C) 2021 Free Software Foundation, Inc.
2025-05-07T20:23:52.3052200Z This is free software; see the source for copying conditions.  There is NO
2025-05-07T20:23:52.3052715Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2025-05-07T20:23:52.3053148Z Written by Roland McGrath and Ulrich Drepper.
2025-05-07T20:23:52.3053357Z 
2025-05-07T20:23:52.3053479Z ################################################################################
2025-05-07T20:23:52.3053773Z [INFO] Print CPU info ...
2025-05-07T20:23:52.3054004Z + nproc
2025-05-07T20:23:52.3054105Z 
2025-05-07T20:23:52.3063959Z 16
2025-05-07T20:23:52.3065637Z 
2025-05-07T20:23:52.3065936Z + lscpu
2025-05-07T20:23:52.3066090Z 
2025-05-07T20:23:52.3182818Z Architecture:                         x86_64
2025-05-07T20:23:52.3183647Z CPU op-mode(s):                       32-bit, 64-bit
2025-05-07T20:23:52.3184417Z Address sizes:                        48 bits physical, 48 bits virtual
2025-05-07T20:23:52.3185157Z Byte Order:                           Little Endian
2025-05-07T20:23:52.3185804Z CPU(s):                               16
2025-05-07T20:23:52.3186361Z On-line CPU(s) list:                  0-15
2025-05-07T20:23:52.3186967Z Vendor ID:                            AuthenticAMD
2025-05-07T20:23:52.3187753Z Model name:                           AMD EPYC 7R32
2025-05-07T20:23:52.3188355Z CPU family:                           23
2025-05-07T20:23:52.3189255Z Model:                                49
2025-05-07T20:23:52.3189801Z Thread(s) per core:                   2
2025-05-07T20:23:52.3190357Z Core(s) per socket:                   8
2025-05-07T20:23:52.3190639Z Socket(s):                            1
2025-05-07T20:23:52.3190901Z Stepping:                             0
2025-05-07T20:23:52.3191194Z BogoMIPS:                             5598.98
2025-05-07T20:23:52.3193210Z Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:52.3195357Z Hypervisor vendor:                    KVM
2025-05-07T20:23:52.3195658Z Virtualization type:                  full
2025-05-07T20:23:52.3195986Z L1d cache:                            256 KiB (8 instances)
2025-05-07T20:23:52.3196335Z L1i cache:                            256 KiB (8 instances)
2025-05-07T20:23:52.3196686Z L2 cache:                             4 MiB (8 instances)
2025-05-07T20:23:52.3197035Z L3 cache:                             32 MiB (2 instances)
2025-05-07T20:23:52.3197341Z NUMA node(s):                         1
2025-05-07T20:23:52.3197628Z NUMA node0 CPU(s):                    0-15
2025-05-07T20:23:52.3197958Z Vulnerability Gather data sampling:   Not affected
2025-05-07T20:23:52.3198318Z Vulnerability Itlb multihit:          Not affected
2025-05-07T20:23:52.3198666Z Vulnerability L1tf:                   Not affected
2025-05-07T20:23:52.3199011Z Vulnerability Mds:                    Not affected
2025-05-07T20:23:52.3199361Z Vulnerability Meltdown:               Not affected
2025-05-07T20:23:52.3199707Z Vulnerability Mmio stale data:        Not affected
2025-05-07T20:23:52.3200106Z Vulnerability Reg file data sampling: Not affected
2025-05-07T20:23:52.3200629Z Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
2025-05-07T20:23:52.3201186Z Vulnerability Spec rstack overflow:   Mitigation; safe RET
2025-05-07T20:23:52.3201717Z Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
2025-05-07T20:23:52.3202395Z Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
2025-05-07T20:23:52.3203237Z Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
2025-05-07T20:23:52.3203884Z Vulnerability Srbds:                  Not affected
2025-05-07T20:23:52.3204238Z Vulnerability Tsx async abort:        Not affected
2025-05-07T20:23:52.3204553Z 
2025-05-07T20:23:52.3204642Z + cat /proc/cpuinfo
2025-05-07T20:23:52.3204772Z 
2025-05-07T20:23:52.3204858Z processor	: 0
2025-05-07T20:23:52.3205069Z vendor_id	: AuthenticAMD
2025-05-07T20:23:52.3205306Z cpu family	: 23
2025-05-07T20:23:52.3205506Z model		: 49
2025-05-07T20:23:52.3205704Z model name	: AMD EPYC 7R32
2025-05-07T20:23:52.3205943Z stepping	: 0
2025-05-07T20:23:52.3206144Z microcode	: 0x830107f
2025-05-07T20:23:52.3206360Z cpu MHz		: 3314.754
2025-05-07T20:23:52.3206568Z cache size	: 512 KB
2025-05-07T20:23:52.3206777Z physical id	: 0
2025-05-07T20:23:52.3206979Z siblings	: 16
2025-05-07T20:23:52.3207178Z core id		: 0
2025-05-07T20:23:52.3207372Z cpu cores	: 8
2025-05-07T20:23:52.3207558Z apicid		: 0
2025-05-07T20:23:52.3207754Z initial apicid	: 0
2025-05-07T20:23:52.3207962Z fpu		: yes
2025-05-07T20:23:52.3208153Z fpu_exception	: yes
2025-05-07T20:23:52.3208366Z cpuid level	: 13
2025-05-07T20:23:52.3208571Z wp		: yes
2025-05-07T20:23:52.3210589Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:52.3212836Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:52.3213307Z bogomips	: 5598.98
2025-05-07T20:23:52.3213525Z TLB size	: 3072 4K pages
2025-05-07T20:23:52.3213760Z clflush size	: 64
2025-05-07T20:23:52.3213967Z cache_alignment	: 64
2025-05-07T20:23:52.3214235Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:52.3214558Z power management:
2025-05-07T20:23:52.3214687Z 
2025-05-07T20:23:52.3214767Z processor	: 1
2025-05-07T20:23:52.3214981Z vendor_id	: AuthenticAMD
2025-05-07T20:23:52.3215218Z cpu family	: 23
2025-05-07T20:23:52.3215417Z model		: 49
2025-05-07T20:23:52.3215621Z model name	: AMD EPYC 7R32
2025-05-07T20:23:52.3215861Z stepping	: 0
2025-05-07T20:23:52.3216057Z microcode	: 0x830107f
2025-05-07T20:23:52.3216280Z cpu MHz		: 2095.023
2025-05-07T20:23:52.3216489Z cache size	: 512 KB
2025-05-07T20:23:52.3216693Z physical id	: 0
2025-05-07T20:23:52.3216902Z siblings	: 16
2025-05-07T20:23:52.3217095Z core id		: 1
2025-05-07T20:23:52.3217286Z cpu cores	: 8
2025-05-07T20:23:52.3217477Z apicid		: 2
2025-05-07T20:23:52.3217672Z initial apicid	: 2
2025-05-07T20:23:52.3217877Z fpu		: yes
2025-05-07T20:23:52.3218065Z fpu_exception	: yes
2025-05-07T20:23:52.3218277Z cpuid level	: 13
2025-05-07T20:23:52.3218477Z wp		: yes
2025-05-07T20:23:52.3220424Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:52.3222604Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:52.3223085Z bogomips	: 5598.98
2025-05-07T20:23:52.3223301Z TLB size	: 3072 4K pages
2025-05-07T20:23:52.3223527Z clflush size	: 64
2025-05-07T20:23:52.3223742Z cache_alignment	: 64
2025-05-07T20:23:52.3224005Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:52.3224309Z power management:
2025-05-07T20:23:52.3224442Z 
2025-05-07T20:23:52.3224527Z processor	: 2
2025-05-07T20:23:52.3224742Z vendor_id	: AuthenticAMD
2025-05-07T20:23:52.3224977Z cpu family	: 23
2025-05-07T20:23:52.3225170Z model		: 49
2025-05-07T20:23:52.3225368Z model name	: AMD EPYC 7R32
2025-05-07T20:23:52.3225599Z stepping	: 0
2025-05-07T20:23:52.3225798Z microcode	: 0x830107f
2025-05-07T20:23:52.3226019Z cpu MHz		: 2018.362
2025-05-07T20:23:52.3226228Z cache size	: 512 KB
2025-05-07T20:23:52.3226446Z physical id	: 0
2025-05-07T20:23:52.3226651Z siblings	: 16
2025-05-07T20:23:52.3226850Z core id		: 2
2025-05-07T20:23:52.3227036Z cpu cores	: 8
2025-05-07T20:23:52.3227233Z apicid		: 4
2025-05-07T20:23:52.3227430Z initial apicid	: 4
2025-05-07T20:23:52.3227732Z fpu		: yes
2025-05-07T20:23:52.3275008Z fpu_exception	: yes
2025-05-07T20:23:52.3275288Z cpuid level	: 13
2025-05-07T20:23:52.3275523Z wp		: yes
2025-05-07T20:23:52.3277766Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:52.3279962Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:52.3280428Z bogomips	: 5598.98
2025-05-07T20:23:52.3280749Z TLB size	: 3072 4K pages
2025-05-07T20:23:52.3280978Z clflush size	: 64
2025-05-07T20:23:52.3281191Z cache_alignment	: 64
2025-05-07T20:23:52.3281448Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:52.3281749Z power management:
2025-05-07T20:23:52.3281877Z 
2025-05-07T20:23:52.3281960Z processor	: 3
2025-05-07T20:23:52.3282158Z vendor_id	: AuthenticAMD
2025-05-07T20:23:52.3282393Z cpu family	: 23
2025-05-07T20:23:52.3282597Z model		: 49
2025-05-07T20:23:52.3282793Z model name	: AMD EPYC 7R32
2025-05-07T20:23:52.3283015Z stepping	: 0
2025-05-07T20:23:52.3283210Z microcode	: 0x830107f
2025-05-07T20:23:52.3283427Z cpu MHz		: 3299.002
2025-05-07T20:23:52.3283623Z cache size	: 512 KB
2025-05-07T20:23:52.3283828Z physical id	: 0
2025-05-07T20:23:52.3284028Z siblings	: 16
2025-05-07T20:23:52.3284214Z core id		: 3
2025-05-07T20:23:52.3284402Z cpu cores	: 8
2025-05-07T20:23:52.3284588Z apicid		: 6
2025-05-07T20:23:52.3284769Z initial apicid	: 6
2025-05-07T20:23:52.3284980Z fpu		: yes
2025-05-07T20:23:52.3285168Z fpu_exception	: yes
2025-05-07T20:23:52.3285370Z cpuid level	: 13
2025-05-07T20:23:52.3285562Z wp		: yes
2025-05-07T20:23:52.3287596Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:52.3289744Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:52.3290254Z bogomips	: 5598.98
2025-05-07T20:23:52.3290463Z TLB size	: 3072 4K pages
2025-05-07T20:23:52.3290686Z clflush size	: 64
2025-05-07T20:23:52.3290883Z cache_alignment	: 64
2025-05-07T20:23:52.3291141Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:52.3291440Z power management:
2025-05-07T20:23:52.3291564Z 
2025-05-07T20:23:52.3291647Z processor	: 4
2025-05-07T20:23:52.3291844Z vendor_id	: AuthenticAMD
2025-05-07T20:23:52.3292065Z cpu family	: 23
2025-05-07T20:23:52.3292261Z model		: 49
2025-05-07T20:23:52.3292454Z model name	: AMD EPYC 7R32
2025-05-07T20:23:52.3292684Z stepping	: 0
2025-05-07T20:23:52.3292884Z microcode	: 0x830107f
2025-05-07T20:23:52.3293090Z cpu MHz		: 3022.864
2025-05-07T20:23:52.3293294Z cache size	: 512 KB
2025-05-07T20:23:52.3293499Z physical id	: 0
2025-05-07T20:23:52.3293691Z siblings	: 16
2025-05-07T20:23:52.3293880Z core id		: 4
2025-05-07T20:23:52.3294066Z cpu cores	: 8
2025-05-07T20:23:52.3294249Z apicid		: 8
2025-05-07T20:23:52.3294432Z initial apicid	: 8
2025-05-07T20:23:52.3294637Z fpu		: yes
2025-05-07T20:23:52.3294876Z fpu_exception	: yes
2025-05-07T20:23:52.3295089Z cpuid level	: 13
2025-05-07T20:23:52.3295289Z wp		: yes
2025-05-07T20:23:52.3297260Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:52.3299415Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:52.3299887Z bogomips	: 5598.98
2025-05-07T20:23:52.3300094Z TLB size	: 3072 4K pages
2025-05-07T20:23:52.3300316Z clflush size	: 64
2025-05-07T20:23:52.3300518Z cache_alignment	: 64
2025-05-07T20:23:52.3300926Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:52.3301224Z power management:
2025-05-07T20:23:52.3301351Z 
2025-05-07T20:23:52.3301432Z processor	: 5
2025-05-07T20:23:52.3301635Z vendor_id	: AuthenticAMD
2025-05-07T20:23:52.3301860Z cpu family	: 23
2025-05-07T20:23:52.3302045Z model		: 49
2025-05-07T20:23:52.3302234Z model name	: AMD EPYC 7R32
2025-05-07T20:23:52.3302467Z stepping	: 0
2025-05-07T20:23:52.3302660Z microcode	: 0x830107f
2025-05-07T20:23:52.3302874Z cpu MHz		: 2590.152
2025-05-07T20:23:52.3303076Z cache size	: 512 KB
2025-05-07T20:23:52.3303276Z physical id	: 0
2025-05-07T20:23:52.3303475Z siblings	: 16
2025-05-07T20:23:52.3303661Z core id		: 5
2025-05-07T20:23:52.3303838Z cpu cores	: 8
2025-05-07T20:23:52.3304023Z apicid		: 10
2025-05-07T20:23:52.3304215Z initial apicid	: 10
2025-05-07T20:23:52.3304412Z fpu		: yes
2025-05-07T20:23:52.3304590Z fpu_exception	: yes
2025-05-07T20:23:52.3304793Z cpuid level	: 13
2025-05-07T20:23:52.3304989Z wp		: yes
2025-05-07T20:23:52.3306869Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:52.3309095Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:52.3309564Z bogomips	: 5598.98
2025-05-07T20:23:52.3309775Z TLB size	: 3072 4K pages
2025-05-07T20:23:52.3309992Z clflush size	: 64
2025-05-07T20:23:52.3310203Z cache_alignment	: 64
2025-05-07T20:23:52.3310461Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:52.3310758Z power management:
2025-05-07T20:23:52.3310891Z 
2025-05-07T20:23:52.3310967Z processor	: 6
2025-05-07T20:23:52.3311173Z vendor_id	: AuthenticAMD
2025-05-07T20:23:52.3311399Z cpu family	: 23
2025-05-07T20:23:52.3311587Z model		: 49
2025-05-07T20:23:52.3311778Z model name	: AMD EPYC 7R32
2025-05-07T20:23:52.3312004Z stepping	: 0
2025-05-07T20:23:52.3312191Z microcode	: 0x830107f
2025-05-07T20:23:52.3312405Z cpu MHz		: 2021.248
2025-05-07T20:23:52.3312607Z cache size	: 512 KB
2025-05-07T20:23:52.3312803Z physical id	: 0
2025-05-07T20:23:52.3312994Z siblings	: 16
2025-05-07T20:23:52.3313184Z core id		: 6
2025-05-07T20:23:52.3313358Z cpu cores	: 8
2025-05-07T20:23:52.3313550Z apicid		: 12
2025-05-07T20:23:52.3313742Z initial apicid	: 12
2025-05-07T20:23:52.3313936Z fpu		: yes
2025-05-07T20:23:52.3314119Z fpu_exception	: yes
2025-05-07T20:23:52.3314323Z cpuid level	: 13
2025-05-07T20:23:52.3314514Z wp		: yes
2025-05-07T20:23:52.3316479Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:52.3318652Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:52.3319121Z bogomips	: 5598.98
2025-05-07T20:23:52.3319321Z TLB size	: 3072 4K pages
2025-05-07T20:23:52.3319544Z clflush size	: 64
2025-05-07T20:23:52.3319748Z cache_alignment	: 64
2025-05-07T20:23:52.3320007Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:52.3320297Z power management:
2025-05-07T20:23:52.3320509Z 
2025-05-07T20:23:52.3320587Z processor	: 7
2025-05-07T20:23:52.3320794Z vendor_id	: AuthenticAMD
2025-05-07T20:23:52.3321015Z cpu family	: 23
2025-05-07T20:23:52.3321210Z model		: 49
2025-05-07T20:23:52.3321402Z model name	: AMD EPYC 7R32
2025-05-07T20:23:52.3321622Z stepping	: 0
2025-05-07T20:23:52.3321813Z microcode	: 0x830107f
2025-05-07T20:23:52.3322020Z cpu MHz		: 3298.161
2025-05-07T20:23:52.3322220Z cache size	: 512 KB
2025-05-07T20:23:52.3322425Z physical id	: 0
2025-05-07T20:23:52.3322619Z siblings	: 16
2025-05-07T20:23:52.3322801Z core id		: 7
2025-05-07T20:23:52.3322988Z cpu cores	: 8
2025-05-07T20:23:52.3323173Z apicid		: 14
2025-05-07T20:23:52.3323356Z initial apicid	: 14
2025-05-07T20:23:52.3323554Z fpu		: yes
2025-05-07T20:23:52.3323739Z fpu_exception	: yes
2025-05-07T20:23:52.3323938Z cpuid level	: 13
2025-05-07T20:23:52.3324138Z wp		: yes
2025-05-07T20:23:52.3326026Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:52.3328232Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:52.3328698Z bogomips	: 5598.98
2025-05-07T20:23:52.3328900Z TLB size	: 3072 4K pages
2025-05-07T20:23:52.3329119Z clflush size	: 64
2025-05-07T20:23:52.3329323Z cache_alignment	: 64
2025-05-07T20:23:52.3329576Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:52.3329871Z power management:
2025-05-07T20:23:52.3329998Z 
2025-05-07T20:23:52.3330078Z processor	: 8
2025-05-07T20:23:52.3330273Z vendor_id	: AuthenticAMD
2025-05-07T20:23:52.3330504Z cpu family	: 23
2025-05-07T20:23:52.3330701Z model		: 49
2025-05-07T20:23:52.3330890Z model name	: AMD EPYC 7R32
2025-05-07T20:23:52.3331117Z stepping	: 0
2025-05-07T20:23:52.3331317Z microcode	: 0x830107f
2025-05-07T20:23:52.3331525Z cpu MHz		: 2903.663
2025-05-07T20:23:52.3331729Z cache size	: 512 KB
2025-05-07T20:23:52.3331928Z physical id	: 0
2025-05-07T20:23:52.3332122Z siblings	: 16
2025-05-07T20:23:52.3332312Z core id		: 0
2025-05-07T20:23:52.3332495Z cpu cores	: 8
2025-05-07T20:23:52.3332682Z apicid		: 1
2025-05-07T20:23:52.3332861Z initial apicid	: 1
2025-05-07T20:23:52.3333061Z fpu		: yes
2025-05-07T20:23:52.3333244Z fpu_exception	: yes
2025-05-07T20:23:52.3333442Z cpuid level	: 13
2025-05-07T20:23:52.3333650Z wp		: yes
2025-05-07T20:23:52.3335532Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:52.3337793Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:52.3338285Z bogomips	: 5598.98
2025-05-07T20:23:52.3338490Z TLB size	: 3072 4K pages
2025-05-07T20:23:52.3338703Z clflush size	: 64
2025-05-07T20:23:52.3338909Z cache_alignment	: 64
2025-05-07T20:23:52.3339163Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:52.3339453Z power management:
2025-05-07T20:23:52.3339584Z 
2025-05-07T20:23:52.3339662Z processor	: 9
2025-05-07T20:23:52.3339857Z vendor_id	: AuthenticAMD
2025-05-07T20:23:52.3340337Z cpu family	: 23
2025-05-07T20:23:52.3340818Z model		: 49
2025-05-07T20:23:52.3341106Z model name	: AMD EPYC 7R32
2025-05-07T20:23:52.3341419Z stepping	: 0
2025-05-07T20:23:52.3341614Z microcode	: 0x830107f
2025-05-07T20:23:52.3341825Z cpu MHz		: 3119.159
2025-05-07T20:23:52.3342020Z cache size	: 512 KB
2025-05-07T20:23:52.3342224Z physical id	: 0
2025-05-07T20:23:52.3342419Z siblings	: 16
2025-05-07T20:23:52.3342604Z core id		: 1
2025-05-07T20:23:52.3342790Z cpu cores	: 8
2025-05-07T20:23:52.3342974Z apicid		: 3
2025-05-07T20:23:52.3343164Z initial apicid	: 3
2025-05-07T20:23:52.3343360Z fpu		: yes
2025-05-07T20:23:52.3343537Z fpu_exception	: yes
2025-05-07T20:23:52.3343739Z cpuid level	: 13
2025-05-07T20:23:52.3343930Z wp		: yes
2025-05-07T20:23:52.3345941Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:52.3348202Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:52.3348663Z bogomips	: 5598.98
2025-05-07T20:23:52.3348871Z TLB size	: 3072 4K pages
2025-05-07T20:23:52.3349098Z clflush size	: 64
2025-05-07T20:23:52.3349302Z cache_alignment	: 64
2025-05-07T20:23:52.3349566Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:52.3349893Z power management:
2025-05-07T20:23:52.3350040Z 
2025-05-07T20:23:52.3350120Z processor	: 10
2025-05-07T20:23:52.3350334Z vendor_id	: AuthenticAMD
2025-05-07T20:23:52.3350562Z cpu family	: 23
2025-05-07T20:23:52.3350754Z model		: 49
2025-05-07T20:23:52.3350952Z model name	: AMD EPYC 7R32
2025-05-07T20:23:52.3351188Z stepping	: 0
2025-05-07T20:23:52.3351376Z microcode	: 0x830107f
2025-05-07T20:23:52.3351591Z cpu MHz		: 3237.299
2025-05-07T20:23:52.3351793Z cache size	: 512 KB
2025-05-07T20:23:52.3351997Z physical id	: 0
2025-05-07T20:23:52.3352196Z siblings	: 16
2025-05-07T20:23:52.3352386Z core id		: 2
2025-05-07T20:23:52.3352567Z cpu cores	: 8
2025-05-07T20:23:52.3352756Z apicid		: 5
2025-05-07T20:23:52.3352944Z initial apicid	: 5
2025-05-07T20:23:52.3353136Z fpu		: yes
2025-05-07T20:23:52.3353323Z fpu_exception	: yes
2025-05-07T20:23:52.3353532Z cpuid level	: 13
2025-05-07T20:23:52.3353731Z wp		: yes
2025-05-07T20:23:52.3355602Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:52.3357756Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:52.3358222Z bogomips	: 5598.98
2025-05-07T20:23:52.3358554Z TLB size	: 3072 4K pages
2025-05-07T20:23:52.3358771Z clflush size	: 64
2025-05-07T20:23:52.3358981Z cache_alignment	: 64
2025-05-07T20:23:52.3359238Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:52.3359531Z power management:
2025-05-07T20:23:52.3359662Z 
2025-05-07T20:23:52.3359742Z processor	: 11
2025-05-07T20:23:52.3359951Z vendor_id	: AuthenticAMD
2025-05-07T20:23:52.3360175Z cpu family	: 23
2025-05-07T20:23:52.3360365Z model		: 49
2025-05-07T20:23:52.3360560Z model name	: AMD EPYC 7R32
2025-05-07T20:23:52.3360788Z stepping	: 0
2025-05-07T20:23:52.3361046Z microcode	: 0x830107f
2025-05-07T20:23:52.3361259Z cpu MHz		: 3237.805
2025-05-07T20:23:52.3361466Z cache size	: 512 KB
2025-05-07T20:23:52.3361665Z physical id	: 0
2025-05-07T20:23:52.3361866Z siblings	: 16
2025-05-07T20:23:52.3362061Z core id		: 3
2025-05-07T20:23:52.3362244Z cpu cores	: 8
2025-05-07T20:23:52.3362489Z apicid		: 7
2025-05-07T20:23:52.3362761Z initial apicid	: 7
2025-05-07T20:23:52.3362968Z fpu		: yes
2025-05-07T20:23:52.3363159Z fpu_exception	: yes
2025-05-07T20:23:52.3363366Z cpuid level	: 13
2025-05-07T20:23:52.3363558Z wp		: yes
2025-05-07T20:23:52.3365554Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:52.3367731Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:52.3368201Z bogomips	: 5598.98
2025-05-07T20:23:52.3368406Z TLB size	: 3072 4K pages
2025-05-07T20:23:52.3368635Z clflush size	: 64
2025-05-07T20:23:52.3368841Z cache_alignment	: 64
2025-05-07T20:23:52.3369099Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:52.3369391Z power management:
2025-05-07T20:23:52.3369527Z 
2025-05-07T20:23:52.3369607Z processor	: 12
2025-05-07T20:23:52.3369810Z vendor_id	: AuthenticAMD
2025-05-07T20:23:52.3370027Z cpu family	: 23
2025-05-07T20:23:52.3370221Z model		: 49
2025-05-07T20:23:52.3370412Z model name	: AMD EPYC 7R32
2025-05-07T20:23:52.3370630Z stepping	: 0
2025-05-07T20:23:52.3370826Z microcode	: 0x830107f
2025-05-07T20:23:52.3371037Z cpu MHz		: 3102.093
2025-05-07T20:23:52.3371241Z cache size	: 512 KB
2025-05-07T20:23:52.3371446Z physical id	: 0
2025-05-07T20:23:52.3371644Z siblings	: 16
2025-05-07T20:23:52.3371825Z core id		: 4
2025-05-07T20:23:52.3372014Z cpu cores	: 8
2025-05-07T20:23:52.3372203Z apicid		: 9
2025-05-07T20:23:52.3372382Z initial apicid	: 9
2025-05-07T20:23:52.3372590Z fpu		: yes
2025-05-07T20:23:52.3372775Z fpu_exception	: yes
2025-05-07T20:23:52.3372978Z cpuid level	: 13
2025-05-07T20:23:52.3373207Z wp		: yes
2025-05-07T20:23:52.3375361Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:52.3377525Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:52.3377994Z bogomips	: 5598.98
2025-05-07T20:23:52.3378200Z TLB size	: 3072 4K pages
2025-05-07T20:23:52.3378426Z clflush size	: 64
2025-05-07T20:23:52.3378631Z cache_alignment	: 64
2025-05-07T20:23:52.3378992Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:52.3379297Z power management:
2025-05-07T20:23:52.3379421Z 
2025-05-07T20:23:52.3379510Z processor	: 13
2025-05-07T20:23:52.3379713Z vendor_id	: AuthenticAMD
2025-05-07T20:23:52.3379941Z cpu family	: 23
2025-05-07T20:23:52.3380141Z model		: 49
2025-05-07T20:23:52.3380332Z model name	: AMD EPYC 7R32
2025-05-07T20:23:52.3380563Z stepping	: 0
2025-05-07T20:23:52.3380762Z microcode	: 0x830107f
2025-05-07T20:23:52.3380975Z cpu MHz		: 2431.197
2025-05-07T20:23:52.3381184Z cache size	: 512 KB
2025-05-07T20:23:52.3381393Z physical id	: 0
2025-05-07T20:23:52.3381666Z siblings	: 16
2025-05-07T20:23:52.3381859Z core id		: 5
2025-05-07T20:23:52.3382052Z cpu cores	: 8
2025-05-07T20:23:52.3382237Z apicid		: 11
2025-05-07T20:23:52.3382433Z initial apicid	: 11
2025-05-07T20:23:52.3382638Z fpu		: yes
2025-05-07T20:23:52.3382825Z fpu_exception	: yes
2025-05-07T20:23:52.3383028Z cpuid level	: 13
2025-05-07T20:23:52.3383229Z wp		: yes
2025-05-07T20:23:52.3385120Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:52.3387292Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:52.3387839Z bogomips	: 5598.98
2025-05-07T20:23:52.3388049Z TLB size	: 3072 4K pages
2025-05-07T20:23:52.3388274Z clflush size	: 64
2025-05-07T20:23:52.3388474Z cache_alignment	: 64
2025-05-07T20:23:52.3388734Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:52.3389037Z power management:
2025-05-07T20:23:52.3389162Z 
2025-05-07T20:23:52.3389240Z processor	: 14
2025-05-07T20:23:52.3389452Z vendor_id	: AuthenticAMD
2025-05-07T20:23:52.3389680Z cpu family	: 23
2025-05-07T20:23:52.3389877Z model		: 49
2025-05-07T20:23:52.3390066Z model name	: AMD EPYC 7R32
2025-05-07T20:23:52.3390296Z stepping	: 0
2025-05-07T20:23:52.3390492Z microcode	: 0x830107f
2025-05-07T20:23:52.3390704Z cpu MHz		: 3064.028
2025-05-07T20:23:52.3390913Z cache size	: 512 KB
2025-05-07T20:23:52.3391119Z physical id	: 0
2025-05-07T20:23:52.3391314Z siblings	: 16
2025-05-07T20:23:52.3391513Z core id		: 6
2025-05-07T20:23:52.3391702Z cpu cores	: 8
2025-05-07T20:23:52.3391887Z apicid		: 13
2025-05-07T20:23:52.3392082Z initial apicid	: 13
2025-05-07T20:23:52.3392284Z fpu		: yes
2025-05-07T20:23:52.3392465Z fpu_exception	: yes
2025-05-07T20:23:52.3392672Z cpuid level	: 13
2025-05-07T20:23:52.3392875Z wp		: yes
2025-05-07T20:23:52.3394760Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:52.3396909Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:52.3397385Z bogomips	: 5598.98
2025-05-07T20:23:52.3397597Z TLB size	: 3072 4K pages
2025-05-07T20:23:52.3397819Z clflush size	: 64
2025-05-07T20:23:52.3398028Z cache_alignment	: 64
2025-05-07T20:23:52.3398293Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:52.3398600Z power management:
2025-05-07T20:23:52.3398728Z 
2025-05-07T20:23:52.3398895Z processor	: 15
2025-05-07T20:23:52.3399115Z vendor_id	: AuthenticAMD
2025-05-07T20:23:52.3399345Z cpu family	: 23
2025-05-07T20:23:52.3399534Z model		: 49
2025-05-07T20:23:52.3399737Z model name	: AMD EPYC 7R32
2025-05-07T20:23:52.3399968Z stepping	: 0
2025-05-07T20:23:52.3400161Z microcode	: 0x830107f
2025-05-07T20:23:52.3400377Z cpu MHz		: 3282.272
2025-05-07T20:23:52.3400587Z cache size	: 512 KB
2025-05-07T20:23:52.3400786Z physical id	: 0
2025-05-07T20:23:52.3400990Z siblings	: 16
2025-05-07T20:23:52.3401186Z core id		: 7
2025-05-07T20:23:52.3401372Z cpu cores	: 8
2025-05-07T20:23:52.3401635Z apicid		: 15
2025-05-07T20:23:52.3401830Z initial apicid	: 15
2025-05-07T20:23:52.3402028Z fpu		: yes
2025-05-07T20:23:52.3402215Z fpu_exception	: yes
2025-05-07T20:23:52.3402423Z cpuid level	: 13
2025-05-07T20:23:52.3402615Z wp		: yes
2025-05-07T20:23:52.3404502Z flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:23:52.3406665Z bugs		: sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass retbleed smt_rsb srso ibpb_no_ret
2025-05-07T20:23:52.3407138Z bogomips	: 5598.98
2025-05-07T20:23:52.3407348Z TLB size	: 3072 4K pages
2025-05-07T20:23:52.3407565Z clflush size	: 64
2025-05-07T20:23:52.3407775Z cache_alignment	: 64
2025-05-07T20:23:52.3408039Z address sizes	: 48 bits physical, 48 bits virtual
2025-05-07T20:23:52.3408337Z power management:
2025-05-07T20:23:52.3408469Z 
2025-05-07T20:23:52.3408473Z 
2025-05-07T20:23:52.3408584Z ################################################################################
2025-05-07T20:23:52.3408880Z [INFO] Print PCI info ...
2025-05-07T20:23:52.3409108Z + lspci -v
2025-05-07T20:23:52.3409224Z 
2025-05-07T20:23:52.3409440Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
2025-05-07T20:23:52.3409806Z 	Subsystem: Amazon.com, Inc. Device 1237
2025-05-07T20:23:52.3410115Z 	Flags: bus master, medium devsel, latency 0
2025-05-07T20:23:52.3410315Z 
2025-05-07T20:23:52.3410512Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
2025-05-07T20:23:52.3410875Z 	Physical Slot: 1
2025-05-07T20:23:52.3411114Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:52.3411310Z 
2025-05-07T20:23:52.3411555Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
2025-05-07T20:23:52.3411968Z 	Physical Slot: 1
2025-05-07T20:23:52.3412218Z 	Flags: bus master, fast devsel, latency 0, IRQ 9
2025-05-07T20:23:52.3412441Z 
2025-05-07T20:23:52.3412701Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 (prog-if 00 [VGA controller])
2025-05-07T20:23:52.3413130Z 	Physical Slot: 3
2025-05-07T20:23:52.3419645Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:52.3420009Z 	Memory at c1000000 (32-bit, prefetchable) [size=4M]
2025-05-07T20:23:52.3420360Z 	Expansion ROM at 000c0000 [disabled] [size=128K]
2025-05-07T20:23:52.3420574Z 
2025-05-07T20:23:52.3420863Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller (prog-if 02 [NVM Express])
2025-05-07T20:23:52.3421353Z 	Subsystem: Amazon.com, Inc. Device 0000
2025-05-07T20:23:52.3421636Z 	Physical Slot: 4
2025-05-07T20:23:52.3421881Z 	Flags: bus master, fast devsel, latency 0, IRQ 11
2025-05-07T20:23:52.3422244Z 	Memory at c1808000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:52.3422586Z 	Capabilities: <access denied>
2025-05-07T20:23:52.3422846Z 	Kernel driver in use: nvme
2025-05-07T20:23:52.3423001Z 
2025-05-07T20:23:52.3423345Z 00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:52.3423809Z 	Subsystem: Amazon.com, Inc. Elastic Network Adapter (ENA)
2025-05-07T20:23:52.3424145Z 	Physical Slot: 5
2025-05-07T20:23:52.3424377Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:52.3424724Z 	Memory at c1804000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:52.3425097Z 	Memory at c1400000 (32-bit, prefetchable) [size=4M]
2025-05-07T20:23:52.3425403Z 	Capabilities: <access denied>
2025-05-07T20:23:52.3425654Z 	Kernel driver in use: ena
2025-05-07T20:23:52.3425885Z 	Kernel modules: ena
2025-05-07T20:23:52.3426094Z 
2025-05-07T20:23:52.3426262Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:52.3426621Z 	Subsystem: NVIDIA Corporation Device 152f
2025-05-07T20:23:52.3426903Z 	Physical Slot: 30
2025-05-07T20:23:52.3427151Z 	Flags: bus master, fast devsel, latency 0, IRQ 10
2025-05-07T20:23:52.3427586Z 	Memory at c0000000 (32-bit, non-prefetchable) [size=16M]
2025-05-07T20:23:52.3427998Z 	Memory at 1800000000 (64-bit, prefetchable) [size=32G]
2025-05-07T20:23:52.3428363Z 	Memory at 1040000000 (64-bit, prefetchable) [size=32M]
2025-05-07T20:23:52.3428680Z 	Capabilities: <access denied>
2025-05-07T20:23:52.3428936Z 	Kernel driver in use: nvidia
2025-05-07T20:23:52.3429180Z 	Kernel modules: nvidia
2025-05-07T20:23:52.3429319Z 
2025-05-07T20:23:52.3429626Z 00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller (prog-if 02 [NVM Express])
2025-05-07T20:23:52.3430110Z 	Subsystem: Amazon.com, Inc. Device 0000
2025-05-07T20:23:52.3430384Z 	Physical Slot: 31
2025-05-07T20:23:52.3430618Z 	Flags: bus master, fast devsel, latency 0
2025-05-07T20:23:52.3430956Z 	Memory at c1800000 (32-bit, non-prefetchable) [size=16K]
2025-05-07T20:23:52.3431325Z 	Memory at c180c000 (32-bit, prefetchable) [size=8K]
2025-05-07T20:23:52.3431639Z 	Capabilities: <access denied>
2025-05-07T20:23:52.3431890Z 	Kernel driver in use: nvme
2025-05-07T20:23:52.3432047Z 
2025-05-07T20:23:52.3432051Z 
2025-05-07T20:23:52.3432162Z ################################################################################
2025-05-07T20:23:52.3432473Z [INFO] Print Linux distribution info ...
2025-05-07T20:23:52.3432747Z + uname -a
2025-05-07T20:23:52.3432851Z 
2025-05-07T20:23:52.3433241Z Linux ip-10-0-29-91.ec2.internal 6.1.130-139.222.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
2025-05-07T20:23:52.3433721Z 
2025-05-07T20:23:52.3433797Z + uname -m
2025-05-07T20:23:52.3433914Z 
2025-05-07T20:23:52.3433986Z x86_64
2025-05-07T20:23:52.3434086Z 
2025-05-07T20:23:52.3434177Z + cat /proc/version
2025-05-07T20:23:52.3434305Z 
2025-05-07T20:23:52.3434826Z Linux version 6.1.130-139.222.amzn2023.x86_64 (mockbuild@ip-10-0-55-76) (gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), GNU ld version 2.39-6.amzn2023.0.11) #1 SMP PREEMPT_DYNAMIC Tue Mar 11 01:10:58 UTC 2025
2025-05-07T20:23:52.3435439Z 
2025-05-07T20:23:52.3435524Z + cat /etc/os-release
2025-05-07T20:23:52.3435670Z 
2025-05-07T20:23:52.3435754Z NAME="Amazon Linux"
2025-05-07T20:23:52.3435960Z VERSION="2023"
2025-05-07T20:23:52.3436156Z ID="amzn"
2025-05-07T20:23:52.3436332Z ID_LIKE="fedora"
2025-05-07T20:23:52.3436528Z VERSION_ID="2023"
2025-05-07T20:23:52.3436742Z PLATFORM_ID="platform:al2023"
2025-05-07T20:23:52.3437009Z PRETTY_NAME="Amazon Linux 2023.6.20250317"
2025-05-07T20:23:52.3437290Z ANSI_COLOR="0;33"
2025-05-07T20:23:52.3437525Z CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
2025-05-07T20:23:52.3437902Z HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/"
2025-05-07T20:23:52.3438327Z DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/"
2025-05-07T20:23:52.3438731Z SUPPORT_URL="https://aws.amazon.com/premiumsupport/"
2025-05-07T20:23:52.3439161Z BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
2025-05-07T20:23:52.3439521Z VENDOR_NAME="AWS"
2025-05-07T20:23:52.3439755Z VENDOR_URL="https://aws.amazon.com/"
2025-05-07T20:23:52.3440029Z SUPPORT_END="2029-06-30"
2025-05-07T20:23:52.3440468Z 
2025-05-07T20:23:52.3440912Z ################################################################################
2025-05-07T20:23:52.3441322Z # Print EC2 Instance Info
2025-05-07T20:23:52.3441546Z #
2025-05-07T20:23:52.3441745Z # [2025-05-07T20:23:52.339Z] + print_ec2_info 
2025-05-07T20:23:52.3442049Z ################################################################################
2025-05-07T20:23:52.3442250Z 
2025-05-07T20:23:52.3521492Z ami-id: ami-071226ecf16aa7d96
2025-05-07T20:23:52.3636056Z instance-id: i-061cb0426579ace80
2025-05-07T20:23:52.3745412Z instance-type: g5.4xlarge
2025-05-07T20:23:52.3786618Z ##[group]Run . $PRELUDE; print_gpu_info
2025-05-07T20:23:52.3787136Z [36;1m. $PRELUDE; print_gpu_info[0m
2025-05-07T20:23:52.3796141Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:52.3796499Z env:
2025-05-07T20:23:52.3796717Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:52.3797028Z   BUILD_ENV: build_binary
2025-05-07T20:23:52.3797277Z   BUILD_TARGET: genai
2025-05-07T20:23:52.3797508Z   BUILD_VARIANT: cuda
2025-05-07T20:23:52.3797735Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:23:52.3798002Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:52.3798301Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:52.3798631Z ##[endgroup]
2025-05-07T20:23:52.7138087Z ################################################################################
2025-05-07T20:23:52.7138469Z [INFO] Printing general display info ...
2025-05-07T20:23:52.7171426Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:52.8328928Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:52.8339624Z /usr/bin/sudo
2025-05-07T20:23:52.8350640Z which: no apt-get in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:52.8361195Z /usr/bin/yum
2025-05-07T20:23:52.8363144Z [INSTALL] Updating system repositories ...
2025-05-07T20:23:52.8385293Z [EXEC] [ATTEMPT 0/3]    + sudo yum update -y
2025-05-07T20:23:53.2689085Z Last metadata expiration check: 0:00:07 ago on Wed May  7 20:23:46 2025.
2025-05-07T20:23:53.3453357Z ================================================================================
2025-05-07T20:23:53.3453996Z WARNING:
2025-05-07T20:23:53.3454307Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:53.3454603Z 
2025-05-07T20:23:53.3454695Z   Available Versions:
2025-05-07T20:23:53.3454855Z 
2025-05-07T20:23:53.3454978Z   Version 2023.7.20250331:
2025-05-07T20:23:53.3455304Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:53.3455553Z 
2025-05-07T20:23:53.3455697Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:53.3455908Z 
2025-05-07T20:23:53.3455994Z     Release notes:
2025-05-07T20:23:53.3456383Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:53.3456761Z 
2025-05-07T20:23:53.3456847Z   Version 2023.7.20250414:
2025-05-07T20:23:53.3457141Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:53.3457378Z 
2025-05-07T20:23:53.3457485Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:53.3457688Z 
2025-05-07T20:23:53.3457767Z     Release notes:
2025-05-07T20:23:53.3458148Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:53.3458510Z 
2025-05-07T20:23:53.3458599Z   Version 2023.7.20250428:
2025-05-07T20:23:53.3458883Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:53.3459124Z 
2025-05-07T20:23:53.3459232Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:53.3459437Z 
2025-05-07T20:23:53.3459532Z     Release notes:
2025-05-07T20:23:53.3459910Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:53.3460266Z 
2025-05-07T20:23:53.3460372Z ================================================================================
2025-05-07T20:23:53.4604931Z Dependencies resolved.
2025-05-07T20:23:53.4891484Z ================================================================================
2025-05-07T20:23:53.4891973Z  Package                       Arch   Version    Repository                Size
2025-05-07T20:23:53.4892517Z ================================================================================
2025-05-07T20:23:53.4892814Z Upgrading:
2025-05-07T20:23:53.4893164Z  nvidia-container-toolkit      x86_64 1.17.6-1   nvidia-container-toolkit 1.2 M
2025-05-07T20:23:53.4893727Z  nvidia-container-toolkit-base x86_64 1.17.6-1   nvidia-container-toolkit 5.7 M
2025-05-07T20:23:53.4894080Z 
2025-05-07T20:23:53.4894432Z Transaction Summary
2025-05-07T20:23:53.4894825Z ================================================================================
2025-05-07T20:23:53.4895127Z Upgrade  2 Packages
2025-05-07T20:23:53.4895267Z 
2025-05-07T20:23:53.4895370Z Total download size: 6.9 M
2025-05-07T20:23:53.4895963Z Downloading Packages:
2025-05-07T20:23:53.5419714Z (1/2): nvidia-container-toolkit-1.17.6-1.x86_64  24 MB/s | 1.2 MB     00:00    
2025-05-07T20:23:53.5683424Z (2/2): nvidia-container-toolkit-base-1.17.6-1.x  73 MB/s | 5.7 MB     00:00    
2025-05-07T20:23:53.5691576Z --------------------------------------------------------------------------------
2025-05-07T20:23:53.5694566Z Total                                            87 MB/s | 6.9 MB     00:00     
2025-05-07T20:23:53.5696891Z Running transaction check
2025-05-07T20:23:53.5793593Z Transaction check succeeded.
2025-05-07T20:23:53.5794217Z Running transaction test
2025-05-07T20:23:53.6087966Z Transaction test succeeded.
2025-05-07T20:23:53.6090879Z Running transaction
2025-05-07T20:23:54.1577044Z   Preparing        :                                                        1/1 
2025-05-07T20:23:54.2636159Z   Upgrading        : nvidia-container-toolkit-base-1.17.6-1.x86_64          1/4 
2025-05-07T20:23:54.2656416Z   Upgrading        : nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:54.2864269Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               2/4 
2025-05-07T20:23:54.2865117Z   Cleanup          : nvidia-container-toolkit-1.16.2-1.x86_64               3/4 
2025-05-07T20:23:54.2965693Z   Running scriptlet: nvidia-container-toolkit-1.16.2-1.x86_64               3/4 
2025-05-07T20:23:54.2987940Z   Cleanup          : nvidia-container-toolkit-base-1.16.2-1.x86_64          4/4 
2025-05-07T20:23:54.4366580Z   Running scriptlet: nvidia-container-toolkit-1.17.6-1.x86_64               4/4 
2025-05-07T20:23:54.4367370Z   Verifying        : nvidia-container-toolkit-1.17.6-1.x86_64               1/4 
2025-05-07T20:23:54.4368007Z   Verifying        : nvidia-container-toolkit-1.16.2-1.x86_64               2/4 
2025-05-07T20:23:54.4368547Z   Verifying        : nvidia-container-toolkit-base-1.17.6-1.x86_64          3/4 
2025-05-07T20:23:54.5867733Z ================================================================================
2025-05-07T20:23:54.5868295Z WARNING:
2025-05-07T20:23:54.5868630Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:54.5868937Z 
2025-05-07T20:23:54.5869072Z   Available Versions:
2025-05-07T20:23:54.5869275Z 
2025-05-07T20:23:54.5869372Z   Version 2023.7.20250331:
2025-05-07T20:23:54.5869681Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:54.5869934Z 
2025-05-07T20:23:54.5870064Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:54.5870269Z 
2025-05-07T20:23:54.5870360Z     Release notes:
2025-05-07T20:23:54.5870751Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:54.5871128Z 
2025-05-07T20:23:54.5871243Z   Version 2023.7.20250414:
2025-05-07T20:23:54.5871552Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:54.5871791Z 
2025-05-07T20:23:54.5871899Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:54.5872107Z 
2025-05-07T20:23:54.5872192Z     Release notes:
2025-05-07T20:23:54.5872578Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:54.5872945Z 
2025-05-07T20:23:54.5873038Z   Version 2023.7.20250428:
2025-05-07T20:23:54.5873328Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:54.5873569Z 
2025-05-07T20:23:54.5873679Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:54.5873882Z 
2025-05-07T20:23:54.5873971Z     Release notes:
2025-05-07T20:23:54.5874343Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:54.5874717Z 
2025-05-07T20:23:54.5875151Z ================================================================================
2025-05-07T20:23:54.6447884Z   Verifying        : nvidia-container-toolkit-base-1.16.2-1.x86_64          4/4 
2025-05-07T20:23:54.6448338Z 
2025-05-07T20:23:54.6448461Z Upgraded:
2025-05-07T20:23:54.6448904Z   nvidia-container-toolkit-1.17.6-1.x86_64                                      
2025-05-07T20:23:54.6449672Z   nvidia-container-toolkit-base-1.17.6-1.x86_64                                 
2025-05-07T20:23:54.6450147Z 
2025-05-07T20:23:54.6450256Z Complete!
2025-05-07T20:23:54.6887012Z [INSTALL] Installing system package(s): hostname lshw ...
2025-05-07T20:23:54.6909548Z [EXEC] [ATTEMPT 0/3]    + sudo yum install -y hostname lshw
2025-05-07T20:23:55.1132551Z Last metadata expiration check: 0:00:09 ago on Wed May  7 20:23:46 2025.
2025-05-07T20:23:55.1375323Z Package hostname-3.23-4.amzn2023.0.3.x86_64 is already installed.
2025-05-07T20:23:55.1777261Z Dependencies resolved.
2025-05-07T20:23:55.1954848Z ================================================================================
2025-05-07T20:23:55.1955519Z  Package    Architecture Version                        Repository         Size
2025-05-07T20:23:55.1956089Z ================================================================================
2025-05-07T20:23:55.1956430Z Installing:
2025-05-07T20:23:55.1956712Z  lshw       x86_64       B.02.19.2-7.amzn2023.0.3       amazonlinux       319 k
2025-05-07T20:23:55.1956972Z 
2025-05-07T20:23:55.1957063Z Transaction Summary
2025-05-07T20:23:55.1957333Z ================================================================================
2025-05-07T20:23:55.1957752Z Install  1 Package
2025-05-07T20:23:55.1957933Z 
2025-05-07T20:23:55.1958070Z Total download size: 319 k
2025-05-07T20:23:55.1958393Z Installed size: 837 k
2025-05-07T20:23:55.1959149Z Downloading Packages:
2025-05-07T20:23:55.2775621Z lshw-B.02.19.2-7.amzn2023.0.3.x86_64.rpm        6.6 MB/s | 319 kB     00:00    
2025-05-07T20:23:55.2781228Z --------------------------------------------------------------------------------
2025-05-07T20:23:55.2783952Z Total                                           3.8 MB/s | 319 kB     00:00     
2025-05-07T20:23:55.2937344Z Running transaction check
2025-05-07T20:23:55.2992767Z Transaction check succeeded.
2025-05-07T20:23:55.2993144Z Running transaction test
2025-05-07T20:23:55.3454931Z Transaction test succeeded.
2025-05-07T20:23:55.3458760Z Running transaction
2025-05-07T20:23:55.4511503Z   Preparing        :                                                        1/1 
2025-05-07T20:23:55.5048941Z   Installing       : lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:55.6767657Z   Running scriptlet: lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:55.8136849Z ================================================================================
2025-05-07T20:23:55.8137341Z WARNING:
2025-05-07T20:23:55.8137649Z   A newer release of "Amazon Linux" is available.
2025-05-07T20:23:55.8137956Z 
2025-05-07T20:23:55.8138074Z   Available Versions:
2025-05-07T20:23:55.8138312Z 
2025-05-07T20:23:55.8138414Z   Version 2023.7.20250331:
2025-05-07T20:23:55.8138716Z     Run the following command to upgrade to 2023.7.20250331:
2025-05-07T20:23:55.8138966Z 
2025-05-07T20:23:55.8139086Z       dnf upgrade --releasever=2023.7.20250331
2025-05-07T20:23:55.8139298Z 
2025-05-07T20:23:55.8139379Z     Release notes:
2025-05-07T20:23:55.8139778Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250331.html
2025-05-07T20:23:55.8140430Z 
2025-05-07T20:23:55.8140541Z   Version 2023.7.20250414:
2025-05-07T20:23:55.8140841Z     Run the following command to upgrade to 2023.7.20250414:
2025-05-07T20:23:55.8141091Z 
2025-05-07T20:23:55.8141202Z       dnf upgrade --releasever=2023.7.20250414
2025-05-07T20:23:55.8141400Z 
2025-05-07T20:23:55.8141490Z     Release notes:
2025-05-07T20:23:55.8141868Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250414.html
2025-05-07T20:23:55.8142243Z 
2025-05-07T20:23:55.8142623Z   Version 2023.7.20250428:
2025-05-07T20:23:55.8143056Z     Run the following command to upgrade to 2023.7.20250428:
2025-05-07T20:23:55.8143303Z 
2025-05-07T20:23:55.8143419Z       dnf upgrade --releasever=2023.7.20250428
2025-05-07T20:23:55.8143616Z 
2025-05-07T20:23:55.8143701Z     Release notes:
2025-05-07T20:23:55.8144091Z      https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.7.20250428.html
2025-05-07T20:23:55.8144442Z 
2025-05-07T20:23:55.8144566Z ================================================================================
2025-05-07T20:23:55.8482079Z   Verifying        : lshw-B.02.19.2-7.amzn2023.0.3.x86_64                   1/1 
2025-05-07T20:23:55.8482513Z 
2025-05-07T20:23:55.8482625Z Installed:
2025-05-07T20:23:55.8483031Z   lshw-B.02.19.2-7.amzn2023.0.3.x86_64                                          
2025-05-07T20:23:55.8483418Z 
2025-05-07T20:23:55.8483516Z Complete!
2025-05-07T20:23:55.8942813Z + hostname
2025-05-07T20:23:55.8943006Z 
2025-05-07T20:23:55.8956139Z ip-10-0-29-91.ec2.internal
2025-05-07T20:23:55.8957133Z 
2025-05-07T20:23:55.8957805Z + sudo lshw -C display
2025-05-07T20:23:55.8958006Z 
2025-05-07T20:23:56.3239381Z   *-display:0 UNCLAIMED
2025-05-07T20:23:56.3239712Z        description: VGA compatible controller
2025-05-07T20:23:56.3240023Z        product: Amazon.com, Inc.
2025-05-07T20:23:56.3240529Z        vendor: Amazon.com, Inc.
2025-05-07T20:23:56.3240805Z        physical id: 3
2025-05-07T20:23:56.3241060Z        bus info: pci@0000:00:03.0
2025-05-07T20:23:56.3241311Z        version: 00
2025-05-07T20:23:56.3241515Z        width: 32 bits
2025-05-07T20:23:56.3241723Z        clock: 33MHz
2025-05-07T20:23:56.3241964Z        capabilities: vga_controller bus_master
2025-05-07T20:23:56.3242270Z        configuration: latency=0
2025-05-07T20:23:56.3242588Z        resources: memory:c1000000-c13fffff memory:c0000-dffff
2025-05-07T20:23:56.3242904Z   *-display:1
2025-05-07T20:23:56.3243121Z        description: 3D controller
2025-05-07T20:23:56.3243413Z        product: GA102GL [A10G]
2025-05-07T20:23:56.3243665Z        vendor: NVIDIA Corporation
2025-05-07T20:23:56.3243929Z        physical id: 1e
2025-05-07T20:23:56.3244159Z        bus info: pci@0000:00:1e.0
2025-05-07T20:23:56.3244402Z        version: a1
2025-05-07T20:23:56.3244608Z        width: 64 bits
2025-05-07T20:23:56.3244823Z        clock: 33MHz
2025-05-07T20:23:56.3245101Z        capabilities: pm pciexpress msix bus_master cap_list
2025-05-07T20:23:56.3245465Z        configuration: driver=nvidia latency=0
2025-05-07T20:23:56.3246082Z        resources: iomemory:180-17f iomemory:100-ff irq:10 memory:c0000000-c0ffffff memory:1800000000-1fffffffff memory:1040000000-1041ffffff
2025-05-07T20:23:56.3279137Z 
2025-05-07T20:23:56.3279349Z ################################################################################
2025-05-07T20:23:56.3279659Z [INFO] Printing NVIDIA GPU info ...
2025-05-07T20:23:56.3410000Z 00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
2025-05-07T20:23:56.3577465Z Wed May  7 20:23:56 2025       
2025-05-07T20:23:56.3586244Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:56.3586951Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:23:56.3587426Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:56.3587989Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:23:56.3588502Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:23:56.3588922Z |                                         |                        |               MIG M. |
2025-05-07T20:23:56.3589241Z |=========================================+========================+======================|
2025-05-07T20:23:56.3657426Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:23:56.3658202Z |  0%   29C    P0             61W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:23:56.3658714Z |                                         |                        |                  N/A |
2025-05-07T20:23:56.3659091Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:23:56.3659476Z                                                                                          
2025-05-07T20:23:56.3660009Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:56.3660431Z | Processes:                                                                              |
2025-05-07T20:23:56.3660889Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:23:56.3661302Z |        ID   ID                                                               Usage      |
2025-05-07T20:23:56.3661647Z |=========================================================================================|
2025-05-07T20:23:56.3662600Z |  No running processes found                                                             |
2025-05-07T20:23:56.3663058Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:23:56.5073529Z ################################################################################
2025-05-07T20:23:56.5073877Z [INFO] Printing AMD GPU info ...
2025-05-07T20:23:56.5220507Z which: no rocminfo in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:56.5221457Z [CHECK] rocminfo not found
2025-05-07T20:23:56.5230412Z which: no rocm-smi in (/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
2025-05-07T20:23:56.5231342Z [CHECK] rocm-smi not found
2025-05-07T20:23:56.5265910Z ##[group]Run . $PRELUDE; setup_miniconda $HOME/miniconda
2025-05-07T20:23:56.5266334Z [36;1m. $PRELUDE; setup_miniconda $HOME/miniconda[0m
2025-05-07T20:23:56.5279116Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:23:56.5279474Z env:
2025-05-07T20:23:56.5279694Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:23:56.5280011Z   BUILD_ENV: build_binary
2025-05-07T20:23:56.5280260Z   BUILD_TARGET: genai
2025-05-07T20:23:56.5280490Z   BUILD_VARIANT: cuda
2025-05-07T20:23:56.5280719Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:23:56.5280998Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:23:56.5281318Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:23:56.5281637Z ##[endgroup]
2025-05-07T20:23:56.8608916Z ################################################################################
2025-05-07T20:23:56.8609275Z # Setup Miniconda
2025-05-07T20:23:56.8609478Z #
2025-05-07T20:23:56.8623894Z # [2025-05-07T20:23:56.862Z] + setup_miniconda /home/ec2-user/miniconda
2025-05-07T20:23:56.8624296Z ################################################################################
2025-05-07T20:23:56.8624512Z 
2025-05-07T20:23:56.8638946Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:23:56.9548079Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:23:56.9548444Z + mkdir -p /home/ec2-user/miniconda
2025-05-07T20:23:56.9548634Z 
2025-05-07T20:23:56.9563689Z 
2025-05-07T20:23:56.9564060Z [SETUP] Downloading the Miniconda installer ...
2025-05-07T20:23:56.9585023Z [EXEC] [ATTEMPT 0/3]    + wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
2025-05-07T20:23:58.5673204Z [SETUP] Installing Miniconda ...
2025-05-07T20:23:58.5673583Z + bash miniconda.sh -b -p /home/ec2-user/miniconda -u
2025-05-07T20:23:58.5673836Z 
2025-05-07T20:23:58.5817173Z PREFIX=/home/ec2-user/miniconda
2025-05-07T20:23:59.0309017Z Unpacking payload ...
2025-05-07T20:23:59.5471727Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:24:00.3442989Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:24:02.4527853Z 
2025-05-07T20:24:02.4528379Z Installing base environment...
2025-05-07T20:24:02.4528601Z 
2025-05-07T20:24:03.5321112Z Preparing transaction: ...working... done
2025-05-07T20:24:06.5147755Z Executing transaction: ...working... done
2025-05-07T20:24:07.1721311Z entry_point.py:256: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
2025-05-07T20:24:07.2599645Z installation finished.
2025-05-07T20:24:07.2608464Z 
2025-05-07T20:24:07.2608627Z + rm -f miniconda.sh
2025-05-07T20:24:07.2608806Z 
2025-05-07T20:24:07.2913857Z 
2025-05-07T20:24:07.2914292Z [SETUP] Reloading the bash configuration ...
2025-05-07T20:24:07.2914650Z + /home/ec2-user/miniconda/bin/conda init bash
2025-05-07T20:24:07.2914858Z 
2025-05-07T20:24:07.6566126Z no change     /home/ec2-user/miniconda/condabin/conda
2025-05-07T20:24:07.6566517Z no change     /home/ec2-user/miniconda/bin/conda
2025-05-07T20:24:07.6566867Z no change     /home/ec2-user/miniconda/bin/conda-env
2025-05-07T20:24:07.6567218Z no change     /home/ec2-user/miniconda/bin/activate
2025-05-07T20:24:07.6567567Z no change     /home/ec2-user/miniconda/bin/deactivate
2025-05-07T20:24:07.6567941Z no change     /home/ec2-user/miniconda/etc/profile.d/conda.sh
2025-05-07T20:24:07.6568362Z no change     /home/ec2-user/miniconda/etc/fish/conf.d/conda.fish
2025-05-07T20:24:07.6568793Z no change     /home/ec2-user/miniconda/shell/condabin/Conda.psm1
2025-05-07T20:24:07.6569228Z no change     /home/ec2-user/miniconda/shell/condabin/conda-hook.ps1
2025-05-07T20:24:07.6570035Z no change     /home/ec2-user/miniconda/lib/python3.13/site-packages/xontrib/conda.xsh
2025-05-07T20:24:07.6570547Z no change     /home/ec2-user/miniconda/etc/profile.d/conda.csh
2025-05-07T20:24:07.6570895Z modified      /home/ec2-user/.bashrc
2025-05-07T20:24:07.6571079Z 
2025-05-07T20:24:07.6571268Z ==> For changes to take effect, close and re-open your current shell. <==
2025-05-07T20:24:07.6571575Z 
2025-05-07T20:24:07.7220471Z 
2025-05-07T20:24:07.7220969Z + . /home/ec2-user/.bashrc
2025-05-07T20:24:07.7221175Z 
2025-05-07T20:24:08.5589159Z 
2025-05-07T20:24:08.5590019Z [SETUP] Installing libmamba-solver (required since Anaconda 2024.02-1) and libarchive ...
2025-05-07T20:24:08.5612794Z [EXEC] [ATTEMPT 0/3]    + conda install --solver=classic -c conda-forge --override-channels -y conda-libmamba-solver libmamba libmambapy libarchive
2025-05-07T20:24:21.7783762Z Collecting package metadata (current_repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - done
2025-05-07T20:24:23.3569837Z Solving environment: | / - \ | / - \ | / - \ done
2025-05-07T20:24:23.4521730Z 
2025-05-07T20:24:23.4521892Z ## Package Plan ##
2025-05-07T20:24:23.4522038Z 
2025-05-07T20:24:23.4522172Z   environment location: /home/ec2-user/miniconda
2025-05-07T20:24:23.4522414Z 
2025-05-07T20:24:23.4522523Z   added / updated specs:
2025-05-07T20:24:23.4522780Z     - conda-libmamba-solver
2025-05-07T20:24:23.4523028Z     - libarchive
2025-05-07T20:24:23.4523228Z     - libmamba
2025-05-07T20:24:23.4523428Z     - libmambapy
2025-05-07T20:24:23.4523550Z 
2025-05-07T20:24:23.4523555Z 
2025-05-07T20:24:23.4523690Z The following packages will be downloaded:
2025-05-07T20:24:23.4523899Z 
2025-05-07T20:24:23.4524014Z     package                    |            build
2025-05-07T20:24:23.4524328Z     ---------------------------|-----------------
2025-05-07T20:24:23.4524729Z     ca-certificates-2025.4.26  |       hbd8a1cb_0         149 KB  conda-forge
2025-05-07T20:24:23.4525397Z     certifi-2025.4.26          |     pyhd8ed1ab_0         154 KB  conda-forge
2025-05-07T20:24:23.4525808Z     conda-25.3.1               |  py313h78bf25f_1         1.1 MB  conda-forge
2025-05-07T20:24:23.4526277Z     conda-libmamba-solver-25.4.0|     pyhd8ed1ab_0          41 KB  conda-forge
2025-05-07T20:24:23.4526722Z     ------------------------------------------------------------
2025-05-07T20:24:23.4527045Z                                            Total:         1.4 MB
2025-05-07T20:24:23.4527258Z 
2025-05-07T20:24:23.4527372Z The following packages will be UPDATED:
2025-05-07T20:24:23.4527612Z 
2025-05-07T20:24:23.4531167Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 
2025-05-07T20:24:23.4531930Z   conda              pkgs/main::conda-25.3.1-py313h06a4308~ --> conda-forge::conda-25.3.1-py313h78bf25f_1 
2025-05-07T20:24:23.4532300Z 
2025-05-07T20:24:23.4532514Z The following packages will be SUPERSEDED by a higher-priority channel:
2025-05-07T20:24:23.4532837Z 
2025-05-07T20:24:23.4533160Z   certifi            pkgs/main/linux-64::certifi-2025.4.26~ --> conda-forge/noarch::certifi-2025.4.26-pyhd8ed1ab_0 
2025-05-07T20:24:23.4533934Z   conda-libmamba-so~ pkgs/main::conda-libmamba-solver-25.4~ --> conda-forge::conda-libmamba-solver-25.4.0-pyhd8ed1ab_0 
2025-05-07T20:24:23.4534408Z 
2025-05-07T20:24:23.4534419Z 
2025-05-07T20:24:23.4534424Z 
2025-05-07T20:24:23.4534564Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:23.4534929Z conda-25.3.1         | 1.1 MB    |            |   0% 
2025-05-07T20:24:23.4535144Z 
2025-05-07T20:24:23.4539561Z certifi-2025.4.26    | 154 KB    |            |   0% [A
2025-05-07T20:24:23.4539825Z 
2025-05-07T20:24:23.4544081Z 
2025-05-07T20:24:23.4555465Z ca-certificates-2025 | 149 KB    |            |   0% [A[A
2025-05-07T20:24:23.4555742Z 
2025-05-07T20:24:23.4555746Z 
2025-05-07T20:24:23.4555750Z 
2025-05-07T20:24:23.5056403Z conda-libmamba-solve | 41 KB     |            |   0% [A[A[A
2025-05-07T20:24:23.5107437Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:24:23.5107752Z 
2025-05-07T20:24:23.5108309Z 
2025-05-07T20:24:23.5151012Z ca-certificates-2025 | 149 KB    | ########## | 100% [A[A
2025-05-07T20:24:23.5151266Z 
2025-05-07T20:24:23.5281299Z certifi-2025.4.26    | 154 KB    | ########## | 100% [A
2025-05-07T20:24:23.5281568Z 
2025-05-07T20:24:23.5281572Z 
2025-05-07T20:24:23.5285792Z 
2025-05-07T20:24:23.5362566Z conda-libmamba-solve | 41 KB     | ########## | 100% [A[A[A
2025-05-07T20:24:23.5362886Z 
2025-05-07T20:24:23.5364193Z 
2025-05-07T20:24:23.5475926Z ca-certificates-2025 | 149 KB    | ########## | 100% [A[A
2025-05-07T20:24:23.5476351Z 
2025-05-07T20:24:23.5578203Z certifi-2025.4.26    | 154 KB    | ########## | 100% [A
2025-05-07T20:24:23.5578612Z 
2025-05-07T20:24:23.5578619Z 
2025-05-07T20:24:23.5579120Z 
2025-05-07T20:24:23.5581593Z conda-libmamba-solve | 41 KB     | ########## | 100% [A[A[A
2025-05-07T20:24:23.5581961Z 
2025-05-07T20:24:23.5581981Z 
2025-05-07T20:24:23.5581986Z 
2025-05-07T20:24:23.6530860Z conda-libmamba-solve | 41 KB     | ########## | 100% [A[A[A
2025-05-07T20:24:23.6531314Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:24:23.6536272Z conda-25.3.1         | 1.1 MB    | ########## | 100% 
2025-05-07T20:24:23.6536721Z                                                      
2025-05-07T20:24:23.6536994Z 
2025-05-07T20:24:23.6537238Z                                                      [A
2025-05-07T20:24:23.6537519Z 
2025-05-07T20:24:23.6537525Z 
2025-05-07T20:24:23.6537767Z                                                      [A[A
2025-05-07T20:24:23.6538043Z 
2025-05-07T20:24:23.6538077Z 
2025-05-07T20:24:23.6538083Z 
2025-05-07T20:24:23.6538333Z                                                      [A[A[A done
2025-05-07T20:24:23.7540761Z Preparing transaction: / done
2025-05-07T20:24:23.8543397Z Verifying transaction: \ done
2025-05-07T20:24:25.1563299Z Executing transaction: / - \ | / - \ | / - \ | / done
2025-05-07T20:24:26.9087648Z [SETUP] Updating Miniconda base packages ...
2025-05-07T20:24:26.9110730Z [EXEC] [ATTEMPT 0/3]    + conda update -n base -c defaults --update-deps -y conda
2025-05-07T20:24:27.8961551Z Channels:
2025-05-07T20:24:27.8961787Z  - defaults
2025-05-07T20:24:27.8961991Z Platform: linux-64
2025-05-07T20:24:29.0906092Z Collecting package metadata (repodata.json): - \ | / - \ | done
2025-05-07T20:24:29.2089557Z Solving environment: - \ Channels:
2025-05-07T20:24:29.2089859Z  - defaults
2025-05-07T20:24:29.2090063Z Platform: linux-64
2025-05-07T20:24:29.5172438Z Collecting package metadata (repodata.json): / - \ | done
2025-05-07T20:24:29.7277543Z Solving environment: - \ | / done
2025-05-07T20:24:29.8143142Z done
2025-05-07T20:24:29.8794620Z 
2025-05-07T20:24:29.8795029Z ## Package Plan ##
2025-05-07T20:24:29.8795270Z 
2025-05-07T20:24:29.8795465Z   environment location: /home/ec2-user/miniconda
2025-05-07T20:24:29.8795876Z 
2025-05-07T20:24:29.8796002Z   added / updated specs:
2025-05-07T20:24:29.8796241Z     - conda
2025-05-07T20:24:29.8796350Z 
2025-05-07T20:24:29.8796354Z 
2025-05-07T20:24:29.8796467Z The following packages will be downloaded:
2025-05-07T20:24:29.8796678Z 
2025-05-07T20:24:29.8796799Z     package                    |            build
2025-05-07T20:24:29.8797106Z     ---------------------------|-----------------
2025-05-07T20:24:29.8797437Z     pip-25.1                   |     pyhc872135_2         1.3 MB
2025-05-07T20:24:29.8797807Z     tzdata-2025b               |       h04d1e81_0         116 KB
2025-05-07T20:24:29.8798548Z     ------------------------------------------------------------
2025-05-07T20:24:29.8798952Z                                            Total:         1.4 MB
2025-05-07T20:24:29.8799150Z 
2025-05-07T20:24:29.8799255Z The following packages will be UPDATED:
2025-05-07T20:24:29.8799460Z 
2025-05-07T20:24:29.8799810Z   pip                pkgs/main/linux-64::pip-25.0-py313h06~ --> pkgs/main/noarch::pip-25.1-pyhc872135_2 
2025-05-07T20:24:29.8800486Z   tzdata                                   2025a-h04d1e81_0 --> 2025b-h04d1e81_0 
2025-05-07T20:24:29.8800825Z 
2025-05-07T20:24:29.8800830Z 
2025-05-07T20:24:29.8800836Z 
2025-05-07T20:24:29.8801038Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:29.8801541Z pip-25.1             | 1.3 MB    |            |   0% 
2025-05-07T20:24:29.8801839Z 
2025-05-07T20:24:29.9367999Z tzdata-2025b         | 116 KB    |            |   0% [A
2025-05-07T20:24:29.9661180Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:24:29.9663805Z 
2025-05-07T20:24:30.1271742Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:24:30.1272484Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:24:30.1666453Z pip-25.1             | 1.3 MB    | ########## | 100% 
2025-05-07T20:24:30.1666918Z 
2025-05-07T20:24:30.1667382Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:24:30.1667987Z 
2025-05-07T20:24:30.1671293Z tzdata-2025b         | 116 KB    | ########## | 100% [A
2025-05-07T20:24:30.1671709Z                                                      
2025-05-07T20:24:30.1671904Z 
2025-05-07T20:24:30.1672075Z                                                      [A done
2025-05-07T20:24:30.2678760Z Preparing transaction: \ done
2025-05-07T20:24:30.3684063Z Verifying transaction: / done
2025-05-07T20:24:32.7714185Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - done
2025-05-07T20:24:33.3702550Z [SETUP] Cleaning up Conda packages ...
2025-05-07T20:24:33.3706266Z + conda clean --packages --tarball -y
2025-05-07T20:24:33.3706541Z 
2025-05-07T20:24:34.3811206Z Will remove 99 (117.8 MB) tarball(s).
2025-05-07T20:24:34.3811547Z Will remove 11 (16.0 MB) package(s).
2025-05-07T20:24:34.4432463Z 
2025-05-07T20:24:34.4441496Z + conda clean --all -y
2025-05-07T20:24:34.4441730Z 
2025-05-07T20:24:34.9819833Z There are no unused tarball(s) to remove.
2025-05-07T20:24:34.9820298Z Will remove 1 index cache(s).
2025-05-07T20:24:34.9820671Z There are no unused package(s) to remove.
2025-05-07T20:24:34.9821095Z There are no tempfile(s) to remove.
2025-05-07T20:24:34.9821503Z There are no logfile(s) to remove.
2025-05-07T20:24:35.0448556Z 
2025-05-07T20:24:35.0453293Z + conda info
2025-05-07T20:24:35.0453490Z 
2025-05-07T20:24:35.7911056Z 
2025-05-07T20:24:35.7911510Z      active environment : base
2025-05-07T20:24:35.7911956Z     active env location : /home/ec2-user/miniconda
2025-05-07T20:24:35.7912276Z             shell level : 1
2025-05-07T20:24:35.7912573Z        user config file : /home/ec2-user/.condarc
2025-05-07T20:24:35.7912948Z  populated config files : /home/ec2-user/miniconda/.condarc
2025-05-07T20:24:35.7913290Z           conda version : 25.3.1
2025-05-07T20:24:35.7913562Z     conda-build version : not installed
2025-05-07T20:24:35.7913854Z          python version : 3.13.2.final.0
2025-05-07T20:24:35.7914147Z                  solver : libmamba (default)
2025-05-07T20:24:35.7914444Z        virtual packages : __archspec=1=zen2
2025-05-07T20:24:35.7914730Z                           __conda=25.3.1=0
2025-05-07T20:24:35.7914989Z                           __cuda=12.8=0
2025-05-07T20:24:35.7915252Z                           __glibc=2.34=0
2025-05-07T20:24:35.7915520Z                           __linux=6.1.130=0
2025-05-07T20:24:35.7915785Z                           __unix=0=0
2025-05-07T20:24:35.7916100Z        base environment : /home/ec2-user/miniconda  (writable)
2025-05-07T20:24:35.7916492Z       conda av data dir : /home/ec2-user/miniconda/etc/conda
2025-05-07T20:24:35.7917162Z   conda av metadata url : None
2025-05-07T20:24:35.7917521Z            channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
2025-05-07T20:24:35.7917941Z                           https://repo.anaconda.com/pkgs/main/noarch
2025-05-07T20:24:35.7918310Z                           https://repo.anaconda.com/pkgs/r/linux-64
2025-05-07T20:24:35.7918673Z                           https://repo.anaconda.com/pkgs/r/noarch
2025-05-07T20:24:35.7919029Z           package cache : /home/ec2-user/miniconda/pkgs
2025-05-07T20:24:35.7919355Z                           /home/ec2-user/.conda/pkgs
2025-05-07T20:24:35.7919684Z        envs directories : /home/ec2-user/miniconda/envs
2025-05-07T20:24:35.7919999Z                           /home/ec2-user/.conda/envs
2025-05-07T20:24:35.7920285Z                platform : linux-64
2025-05-07T20:24:35.7921112Z              user-agent : conda/25.3.1 requests/2.32.3 CPython/3.13.2 Linux/6.1.130-139.222.amzn2023.x86_64 amzn/2023.6.20250317 glibc/2.34 solver/libmamba conda-libmamba-solver/25.4.0 libmambapy/2.0.5 aau/0.7.0 c/. s/. e/.
2025-05-07T20:24:35.7921904Z                 UID:GID : 1000:1000
2025-05-07T20:24:35.7922160Z              netrc file : None
2025-05-07T20:24:35.7922406Z            offline mode : False
2025-05-07T20:24:35.7922568Z 
2025-05-07T20:24:35.8565447Z 
2025-05-07T20:24:35.8566105Z [SETUP] Exporting Miniconda variables ...
2025-05-07T20:24:35.8566872Z [SETUP] Saving Miniconda variables to /home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_ec600945-1e6b-443f-bc5e-7e18edd52288 ...
2025-05-07T20:24:35.8567646Z [SETUP] Successfully set up Miniconda at /home/ec2-user/miniconda
2025-05-07T20:24:35.8637099Z ##[group]Run . $PRELUDE; create_conda_environment $BUILD_ENV 3.13
2025-05-07T20:24:35.8637589Z [36;1m. $PRELUDE; create_conda_environment $BUILD_ENV 3.13[0m
2025-05-07T20:24:35.8654473Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:24:35.8664054Z env:
2025-05-07T20:24:35.8664294Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:24:35.8664594Z   BUILD_ENV: build_binary
2025-05-07T20:24:35.8664836Z   BUILD_TARGET: genai
2025-05-07T20:24:35.8665066Z   BUILD_VARIANT: cuda
2025-05-07T20:24:35.8665293Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:24:35.8665547Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:24:35.8665846Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:24:35.8666387Z ##[endgroup]
2025-05-07T20:24:36.2027927Z ################################################################################
2025-05-07T20:24:36.2028287Z # Create Conda Environment
2025-05-07T20:24:36.2028535Z #
2025-05-07T20:24:36.2044125Z # [2025-05-07T20:24:36.204Z] + create_conda_environment build_binary 3.13
2025-05-07T20:24:36.2044539Z ################################################################################
2025-05-07T20:24:36.2044749Z 
2025-05-07T20:24:36.2059377Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:24:36.2958564Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:24:36.2958931Z [SETUP] Listing existing Conda environments ...
2025-05-07T20:24:36.2959245Z + conda info --envs
2025-05-07T20:24:36.2959380Z 
2025-05-07T20:24:37.0398681Z 
2025-05-07T20:24:37.0398959Z # conda environments:
2025-05-07T20:24:37.0399203Z #
2025-05-07T20:24:37.0399417Z base                   /home/ec2-user/miniconda
2025-05-07T20:24:37.0399647Z 
2025-05-07T20:24:37.1058497Z 
2025-05-07T20:24:37.1059133Z [SETUP] Deleting the prefix directory if it exists ...
2025-05-07T20:24:38.7264636Z + rm -rf /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:38.7265152Z 
2025-05-07T20:24:38.7277880Z 
2025-05-07T20:24:38.7288162Z [SETUP] Creating new Conda environment (Python 3.13) ...
2025-05-07T20:24:38.7311793Z [EXEC] [ATTEMPT 0/3]    + conda create -y -n build_binary python=3.13
2025-05-07T20:24:39.4887287Z Channels:
2025-05-07T20:24:39.4887706Z  - defaults
2025-05-07T20:24:39.4888111Z Platform: linux-64
2025-05-07T20:24:41.0400620Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ done
2025-05-07T20:24:41.1645656Z Solving environment: / done
2025-05-07T20:24:41.1932998Z 
2025-05-07T20:24:41.1933354Z ## Package Plan ##
2025-05-07T20:24:41.1933644Z 
2025-05-07T20:24:41.1934086Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:24:41.1934697Z 
2025-05-07T20:24:41.1934915Z   added / updated specs:
2025-05-07T20:24:41.1935396Z     - python=3.13
2025-05-07T20:24:41.1935656Z 
2025-05-07T20:24:41.1935666Z 
2025-05-07T20:24:41.1935904Z The following packages will be downloaded:
2025-05-07T20:24:41.1936314Z 
2025-05-07T20:24:41.1936543Z     package                    |            build
2025-05-07T20:24:41.1937167Z     ---------------------------|-----------------
2025-05-07T20:24:41.1937863Z     _libgcc_mutex-0.1          |             main           3 KB
2025-05-07T20:24:41.1938620Z     _openmp_mutex-5.1          |            1_gnu          21 KB
2025-05-07T20:24:41.1939418Z     ca-certificates-2025.2.25  |       h06a4308_0         129 KB
2025-05-07T20:24:41.1940501Z     python_abi-3.13            |          0_cp313           6 KB
2025-05-07T20:24:41.1941204Z     ------------------------------------------------------------
2025-05-07T20:24:41.1941833Z                                            Total:         159 KB
2025-05-07T20:24:41.1942175Z 
2025-05-07T20:24:41.1942295Z The following NEW packages will be INSTALLED:
2025-05-07T20:24:41.1942517Z 
2025-05-07T20:24:41.1942716Z   _libgcc_mutex      pkgs/main/linux-64::_libgcc_mutex-0.1-main 
2025-05-07T20:24:41.1943154Z   _openmp_mutex      pkgs/main/linux-64::_openmp_mutex-5.1-1_gnu 
2025-05-07T20:24:41.1943986Z   bzip2              pkgs/main/linux-64::bzip2-1.0.8-h5eee18b_6 
2025-05-07T20:24:41.1944470Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2025.2.25-h06a4308_0 
2025-05-07T20:24:41.1944940Z   expat              pkgs/main/linux-64::expat-2.7.1-h6a678d5_0 
2025-05-07T20:24:41.1945374Z   ld_impl_linux-64   pkgs/main/linux-64::ld_impl_linux-64-2.40-h12ee557_0 
2025-05-07T20:24:41.1945835Z   libffi             pkgs/main/linux-64::libffi-3.4.4-h6a678d5_1 
2025-05-07T20:24:41.1946248Z   libgcc-ng          pkgs/main/linux-64::libgcc-ng-11.2.0-h1234567_1 
2025-05-07T20:24:41.1946670Z   libgomp            pkgs/main/linux-64::libgomp-11.2.0-h1234567_1 
2025-05-07T20:24:41.1947083Z   libmpdec           pkgs/main/linux-64::libmpdec-4.0.0-h5eee18b_0 
2025-05-07T20:24:41.1947775Z   libstdcxx-ng       pkgs/main/linux-64::libstdcxx-ng-11.2.0-h1234567_1 
2025-05-07T20:24:41.1948216Z   libuuid            pkgs/main/linux-64::libuuid-1.41.5-h5eee18b_0 
2025-05-07T20:24:41.1948628Z   ncurses            pkgs/main/linux-64::ncurses-6.4-h6a678d5_0 
2025-05-07T20:24:41.1949029Z   openssl            pkgs/main/linux-64::openssl-3.0.16-h5eee18b_0 
2025-05-07T20:24:41.1949419Z   pip                pkgs/main/noarch::pip-25.1-pyhc872135_2 
2025-05-07T20:24:41.1949825Z   python             pkgs/main/linux-64::python-3.13.2-hf623796_100_cp313 
2025-05-07T20:24:41.1950263Z   python_abi         pkgs/main/linux-64::python_abi-3.13-0_cp313 
2025-05-07T20:24:41.1950673Z   readline           pkgs/main/linux-64::readline-8.2-h5eee18b_0 
2025-05-07T20:24:41.1951139Z   setuptools         pkgs/main/linux-64::setuptools-78.1.1-py313h06a4308_0 
2025-05-07T20:24:41.1951589Z   sqlite             pkgs/main/linux-64::sqlite-3.45.3-h5eee18b_0 
2025-05-07T20:24:41.1951962Z   tk                 pkgs/main/linux-64::tk-8.6.14-h39e8969_0 
2025-05-07T20:24:41.1952332Z   tzdata             pkgs/main/noarch::tzdata-2025b-h04d1e81_0 
2025-05-07T20:24:41.1952737Z   wheel              pkgs/main/linux-64::wheel-0.45.1-py313h06a4308_0 
2025-05-07T20:24:41.1953123Z   xz                 pkgs/main/linux-64::xz-5.6.4-h5eee18b_1 
2025-05-07T20:24:41.1953478Z   zlib               pkgs/main/linux-64::zlib-1.2.13-h5eee18b_1 
2025-05-07T20:24:41.1953713Z 
2025-05-07T20:24:41.1953718Z 
2025-05-07T20:24:41.1953722Z 
2025-05-07T20:24:41.1953861Z Downloading and Extracting Packages: ...working...
2025-05-07T20:24:41.1954242Z ca-certificates-2025 | 129 KB    |            |   0% 
2025-05-07T20:24:41.1954467Z 
2025-05-07T20:24:41.1954815Z _openmp_mutex-5.1    | 21 KB     |            |   0% [A
2025-05-07T20:24:41.1955050Z 
2025-05-07T20:24:41.1955054Z 
2025-05-07T20:24:41.1965409Z python_abi-3.13      | 6 KB      |            |   0% [A[A
2025-05-07T20:24:41.1965646Z 
2025-05-07T20:24:41.1965650Z 
2025-05-07T20:24:41.1965664Z 
2025-05-07T20:24:41.2320986Z _libgcc_mutex-0.1    | 3 KB      |            |   0% [A[A[A
2025-05-07T20:24:41.2322294Z 
2025-05-07T20:24:41.2429032Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A
2025-05-07T20:24:41.2478128Z ca-certificates-2025 | 129 KB    | ########## | 100% 
2025-05-07T20:24:41.2478371Z 
2025-05-07T20:24:41.2480372Z 
2025-05-07T20:24:41.2562472Z python_abi-3.13      | 6 KB      | ########## | 100% [A[A
2025-05-07T20:24:41.2562711Z 
2025-05-07T20:24:41.2562722Z 
2025-05-07T20:24:41.2562977Z 
2025-05-07T20:24:41.2662229Z _libgcc_mutex-0.1    | 3 KB      | ########## | 100% [A[A[A
2025-05-07T20:24:41.2662525Z 
2025-05-07T20:24:41.2662834Z 
2025-05-07T20:24:41.2698049Z python_abi-3.13      | 6 KB      | ########## | 100% [A[A
2025-05-07T20:24:41.2705010Z ca-certificates-2025 | 129 KB    | ########## | 100% 
2025-05-07T20:24:41.2705943Z 
2025-05-07T20:24:41.2734902Z _openmp_mutex-5.1    | 21 KB     | ########## | 100% [A
2025-05-07T20:24:41.2735157Z 
2025-05-07T20:24:41.2735162Z 
2025-05-07T20:24:41.2735167Z 
2025-05-07T20:24:41.2740709Z _libgcc_mutex-0.1    | 3 KB      | ########## | 100% [A[A[A
2025-05-07T20:24:41.2741120Z                                                      
2025-05-07T20:24:41.2741317Z 
2025-05-07T20:24:41.2741800Z                                                      [A
2025-05-07T20:24:41.2742105Z 
2025-05-07T20:24:41.2742113Z 
2025-05-07T20:24:41.2742367Z                                                      [A[A
2025-05-07T20:24:41.2742634Z 
2025-05-07T20:24:41.2742638Z 
2025-05-07T20:24:41.2742642Z 
2025-05-07T20:24:41.2742832Z                                                      [A[A[A done
2025-05-07T20:24:41.4849979Z Preparing transaction: \ | done
2025-05-07T20:24:42.9096416Z Verifying transaction: - \ | / - \ | / - \ | / - done
2025-05-07T20:24:45.2245094Z Executing transaction: | / - \ | / - \ | / - \ | / - \ | / - \ | / - done
2025-05-07T20:24:45.2744201Z #
2025-05-07T20:24:45.2744626Z # To activate this environment, use
2025-05-07T20:24:45.2745129Z #
2025-05-07T20:24:45.2745477Z #     $ conda activate build_binary
2025-05-07T20:24:45.2745923Z #
2025-05-07T20:24:45.2746294Z # To deactivate an active environment, use
2025-05-07T20:24:45.2746809Z #
2025-05-07T20:24:45.2747127Z #     $ conda deactivate
2025-05-07T20:24:45.2747410Z 
2025-05-07T20:24:45.3785508Z [SETUP] Upgrading PIP to latest ...
2025-05-07T20:24:45.3809720Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --upgrade pip
2025-05-07T20:24:48.3444635Z Requirement already satisfied: pip in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (25.1)
2025-05-07T20:24:48.3445256Z Collecting pip
2025-05-07T20:24:48.3445564Z   Downloading pip-25.1.1-py3-none-any.whl.metadata (3.6 kB)
2025-05-07T20:24:48.3445979Z Downloading pip-25.1.1-py3-none-any.whl (1.8 MB)
2025-05-07T20:24:48.3448738Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.8/1.8 MB 83.4 MB/s eta 0:00:00
2025-05-07T20:24:48.3449142Z Installing collected packages: pip
2025-05-07T20:24:48.3449431Z   Attempting uninstall: pip
2025-05-07T20:24:48.3449709Z     Found existing installation: pip 25.1
2025-05-07T20:24:48.3450023Z     Uninstalling pip-25.1:
2025-05-07T20:24:48.3450297Z       Successfully uninstalled pip-25.1
2025-05-07T20:24:48.3450603Z Successfully installed pip-25.1.1
2025-05-07T20:24:48.3450789Z 
2025-05-07T20:24:48.4073836Z [SETUP] Upgrading pyOpenSSL ...
2025-05-07T20:24:48.4096606Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y pyOpenSSL>22.1.0
2025-05-07T20:24:49.2629440Z Channels:
2025-05-07T20:24:49.2629685Z  - conda-forge
2025-05-07T20:24:49.2629904Z Platform: linux-64
2025-05-07T20:24:59.4949187Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ done
2025-05-07T20:25:01.1755166Z Solving environment: / - \ | / done
2025-05-07T20:25:01.2373267Z 
2025-05-07T20:25:01.2373613Z ## Package Plan ##
2025-05-07T20:25:01.2373780Z 
2025-05-07T20:25:01.2374013Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:25:01.2374312Z 
2025-05-07T20:25:01.2374413Z   added / updated specs:
2025-05-07T20:25:01.2374688Z     - pyopenssl[version='>22.1.0']
2025-05-07T20:25:01.2374885Z 
2025-05-07T20:25:01.2374889Z 
2025-05-07T20:25:01.2375007Z The following packages will be downloaded:
2025-05-07T20:25:01.2375222Z 
2025-05-07T20:25:01.2375338Z     package                    |            build
2025-05-07T20:25:01.2375653Z     ---------------------------|-----------------
2025-05-07T20:25:01.2376008Z     cffi-1.17.1                |  py313hfab6e84_0         289 KB  conda-forge
2025-05-07T20:25:01.2376449Z     cryptography-44.0.3        |  py313h6556f6e_0         1.5 MB  conda-forge
2025-05-07T20:25:01.2376882Z     libgcc-15.1.0              |       h767d61c_2         810 KB  conda-forge
2025-05-07T20:25:01.2377284Z     libgcc-ng-15.1.0           |       h69a702a_2          34 KB  conda-forge
2025-05-07T20:25:01.2377695Z     libgomp-15.1.0             |       h767d61c_2         442 KB  conda-forge
2025-05-07T20:25:01.2378093Z     openssl-3.5.0              |       h7b32b05_1         3.0 MB  conda-forge
2025-05-07T20:25:01.2378832Z     pycparser-2.22             |     pyh29332c3_1         108 KB  conda-forge
2025-05-07T20:25:01.2379273Z     pyopenssl-25.0.0           |     pyhd8ed1ab_0         120 KB  conda-forge
2025-05-07T20:25:01.2379724Z     typing-extensions-4.13.2   |       h0e9735f_0          88 KB  conda-forge
2025-05-07T20:25:01.2380193Z     typing_extensions-4.13.2   |     pyh29332c3_0          51 KB  conda-forge
2025-05-07T20:25:01.2380599Z     ------------------------------------------------------------
2025-05-07T20:25:01.2380933Z                                            Total:         6.4 MB
2025-05-07T20:25:01.2381140Z 
2025-05-07T20:25:01.2381265Z The following NEW packages will be INSTALLED:
2025-05-07T20:25:01.2381478Z 
2025-05-07T20:25:01.2381824Z   cffi               conda-forge/linux-64::cffi-1.17.1-py313hfab6e84_0 
2025-05-07T20:25:01.2382312Z   cryptography       conda-forge/linux-64::cryptography-44.0.3-py313h6556f6e_0 
2025-05-07T20:25:01.2382794Z   libgcc             conda-forge/linux-64::libgcc-15.1.0-h767d61c_2 
2025-05-07T20:25:01.2385002Z   pycparser          conda-forge/noarch::pycparser-2.22-pyh29332c3_1 
2025-05-07T20:25:01.2385476Z   pyopenssl          conda-forge/noarch::pyopenssl-25.0.0-pyhd8ed1ab_0 
2025-05-07T20:25:01.2385986Z   typing-extensions  conda-forge/noarch::typing-extensions-4.13.2-h0e9735f_0 
2025-05-07T20:25:01.2386552Z   typing_extensions  conda-forge/noarch::typing_extensions-4.13.2-pyh29332c3_0 
2025-05-07T20:25:01.2386886Z 
2025-05-07T20:25:01.2386998Z The following packages will be UPDATED:
2025-05-07T20:25:01.2387199Z 
2025-05-07T20:25:01.2387698Z   ca-certificates    pkgs/main/linux-64::ca-certificates-2~ --> conda-forge/noarch::ca-certificates-2025.4.26-hbd8a1cb_0 
2025-05-07T20:25:01.2388458Z   libgcc-ng          pkgs/main::libgcc-ng-11.2.0-h1234567_1 --> conda-forge::libgcc-ng-15.1.0-h69a702a_2 
2025-05-07T20:25:01.2389107Z   libgomp              pkgs/main::libgomp-11.2.0-h1234567_1 --> conda-forge::libgomp-15.1.0-h767d61c_2 
2025-05-07T20:25:01.2389725Z   openssl              pkgs/main::openssl-3.0.16-h5eee18b_0 --> conda-forge::openssl-3.5.0-h7b32b05_1 
2025-05-07T20:25:01.2390115Z 
2025-05-07T20:25:01.2390119Z 
2025-05-07T20:25:01.2390123Z 
2025-05-07T20:25:01.2390270Z Downloading and Extracting Packages: ...working...
2025-05-07T20:25:01.2390643Z openssl-3.5.0        | 3.0 MB    |            |   0% 
2025-05-07T20:25:01.2390862Z 
2025-05-07T20:25:01.2391261Z cryptography-44.0.3  | 1.5 MB    |            |   0% [A
2025-05-07T20:25:01.2391507Z 
2025-05-07T20:25:01.2391511Z 
2025-05-07T20:25:01.2402697Z libgcc-15.1.0        | 810 KB    |            |   0% [A[A
2025-05-07T20:25:01.2402936Z 
2025-05-07T20:25:01.2402940Z 
2025-05-07T20:25:01.2402948Z 
2025-05-07T20:25:01.2418573Z libgomp-15.1.0       | 442 KB    |            |   0% [A[A[A
2025-05-07T20:25:01.2418822Z 
2025-05-07T20:25:01.2418833Z 
2025-05-07T20:25:01.2418837Z 
2025-05-07T20:25:01.2422357Z 
2025-05-07T20:25:01.2438319Z cffi-1.17.1          | 289 KB    |            |   0% [A[A[A[A
2025-05-07T20:25:01.2438591Z 
2025-05-07T20:25:01.2438601Z 
2025-05-07T20:25:01.2438612Z 
2025-05-07T20:25:01.2438615Z 
2025-05-07T20:25:01.2441618Z 
2025-05-07T20:25:01.2445081Z pyopenssl-25.0.0     | 120 KB    |            |   0% [A[A[A[A[A
2025-05-07T20:25:01.2445496Z 
2025-05-07T20:25:01.2445502Z 
2025-05-07T20:25:01.2445507Z 
2025-05-07T20:25:01.2445513Z 
2025-05-07T20:25:01.2445518Z 
2025-05-07T20:25:01.2445530Z 
2025-05-07T20:25:01.2449996Z pycparser-2.22       | 108 KB    |            |   0% [A[A[A[A[A[A
2025-05-07T20:25:01.2450281Z 
2025-05-07T20:25:01.2450285Z 
2025-05-07T20:25:01.2450297Z 
2025-05-07T20:25:01.2450300Z 
2025-05-07T20:25:01.2450304Z 
2025-05-07T20:25:01.2450308Z 
2025-05-07T20:25:01.2450312Z 
2025-05-07T20:25:01.2461721Z typing-extensions-4. | 88 KB     |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:25:01.2462032Z 
2025-05-07T20:25:01.2462036Z 
2025-05-07T20:25:01.2462040Z 
2025-05-07T20:25:01.2462043Z 
2025-05-07T20:25:01.2462047Z 
2025-05-07T20:25:01.2462051Z 
2025-05-07T20:25:01.2462054Z 
2025-05-07T20:25:01.2463572Z 
2025-05-07T20:25:01.2464918Z typing_extensions-4. | 51 KB     |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:25:01.2465214Z 
2025-05-07T20:25:01.2465218Z 
2025-05-07T20:25:01.2465222Z 
2025-05-07T20:25:01.2465225Z 
2025-05-07T20:25:01.2465229Z 
2025-05-07T20:25:01.2465233Z 
2025-05-07T20:25:01.2465236Z 
2025-05-07T20:25:01.2465240Z 
2025-05-07T20:25:01.2465244Z 
2025-05-07T20:25:01.2930233Z libgcc-ng-15.1.0     | 34 KB     |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:01.2930548Z 
2025-05-07T20:25:01.2930552Z 
2025-05-07T20:25:01.2931237Z 
2025-05-07T20:25:01.3343715Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A
2025-05-07T20:25:01.3344033Z 
2025-05-07T20:25:01.3344372Z 
2025-05-07T20:25:01.3344376Z 
2025-05-07T20:25:01.3346971Z 
2025-05-07T20:25:01.3375494Z cffi-1.17.1          | 289 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:25:01.3389284Z openssl-3.5.0        | 3.0 MB    | ##5        |  26% 
2025-05-07T20:25:01.3392575Z 
2025-05-07T20:25:01.3409966Z cryptography-44.0.3  | 1.5 MB    | ##1        |  21% [A
2025-05-07T20:25:01.3410341Z 
2025-05-07T20:25:01.3410347Z 
2025-05-07T20:25:01.3410353Z 
2025-05-07T20:25:01.3410367Z 
2025-05-07T20:25:01.3412492Z 
2025-05-07T20:25:01.3445385Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:01.3445671Z 
2025-05-07T20:25:01.3445675Z 
2025-05-07T20:25:01.3451089Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A
2025-05-07T20:25:01.3451338Z 
2025-05-07T20:25:01.3452389Z 
2025-05-07T20:25:01.3838538Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A
2025-05-07T20:25:01.3838888Z 
2025-05-07T20:25:01.3838893Z 
2025-05-07T20:25:01.3841231Z 
2025-05-07T20:25:01.3852046Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A
2025-05-07T20:25:01.3852388Z 
2025-05-07T20:25:01.3852393Z 
2025-05-07T20:25:01.3852397Z 
2025-05-07T20:25:01.3854756Z libgomp-15.1.0       | 442 KB    | ########## | 100% [A[A[A
2025-05-07T20:25:01.3855091Z 
2025-05-07T20:25:01.3855105Z 
2025-05-07T20:25:01.3855109Z 
2025-05-07T20:25:01.3855112Z 
2025-05-07T20:25:01.3855116Z 
2025-05-07T20:25:01.3855120Z 
2025-05-07T20:25:01.3855124Z 
2025-05-07T20:25:01.3855127Z 
2025-05-07T20:25:01.3859243Z typing_extensions-4. | 51 KB     | ###1       |  31% [A[A[A[A[A[A[A[A
2025-05-07T20:25:01.3859586Z 
2025-05-07T20:25:01.3859592Z 
2025-05-07T20:25:01.3859597Z 
2025-05-07T20:25:01.3859602Z 
2025-05-07T20:25:01.3859607Z 
2025-05-07T20:25:01.3859612Z 
2025-05-07T20:25:01.3922290Z pycparser-2.22       | 108 KB    | #4         |  15% [A[A[A[A[A[A
2025-05-07T20:25:01.3922571Z 
2025-05-07T20:25:01.3922575Z 
2025-05-07T20:25:01.3922579Z 
2025-05-07T20:25:01.3922583Z 
2025-05-07T20:25:01.3922595Z 
2025-05-07T20:25:01.3922598Z 
2025-05-07T20:25:01.3922602Z 
2025-05-07T20:25:01.3925486Z 
2025-05-07T20:25:01.3972614Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:01.3972993Z 
2025-05-07T20:25:01.3972998Z 
2025-05-07T20:25:01.3973001Z 
2025-05-07T20:25:01.3973016Z 
2025-05-07T20:25:01.3973020Z 
2025-05-07T20:25:01.3973024Z 
2025-05-07T20:25:01.3990056Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:01.3990414Z 
2025-05-07T20:25:01.3990419Z 
2025-05-07T20:25:01.3990422Z 
2025-05-07T20:25:01.3990426Z 
2025-05-07T20:25:01.3990430Z 
2025-05-07T20:25:01.3990433Z 
2025-05-07T20:25:01.3990437Z 
2025-05-07T20:25:01.4069814Z typing-extensions-4. | 88 KB     | #8         |  18% [A[A[A[A[A[A[A
2025-05-07T20:25:01.4070180Z 
2025-05-07T20:25:01.4070184Z 
2025-05-07T20:25:01.4070188Z 
2025-05-07T20:25:01.4070192Z 
2025-05-07T20:25:01.4070196Z 
2025-05-07T20:25:01.4070199Z 
2025-05-07T20:25:01.4070203Z 
2025-05-07T20:25:01.4131007Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:01.4131580Z 
2025-05-07T20:25:01.4382781Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A
2025-05-07T20:25:01.4383088Z 
2025-05-07T20:25:01.4383092Z 
2025-05-07T20:25:01.4383095Z 
2025-05-07T20:25:01.4383311Z 
2025-05-07T20:25:01.4383316Z 
2025-05-07T20:25:01.4389157Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:01.4389438Z 
2025-05-07T20:25:01.4389442Z 
2025-05-07T20:25:01.4389446Z 
2025-05-07T20:25:01.4389450Z 
2025-05-07T20:25:01.4391336Z 
2025-05-07T20:25:01.4433281Z pyopenssl-25.0.0     | 120 KB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:01.4433607Z 
2025-05-07T20:25:01.4433613Z 
2025-05-07T20:25:01.4433618Z 
2025-05-07T20:25:01.4433623Z 
2025-05-07T20:25:01.4433628Z 
2025-05-07T20:25:01.4433634Z 
2025-05-07T20:25:01.4433639Z 
2025-05-07T20:25:01.4433644Z 
2025-05-07T20:25:01.4434107Z 
2025-05-07T20:25:01.4463075Z libgcc-ng-15.1.0     | 34 KB     | ####7      |  47% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:01.4463574Z 
2025-05-07T20:25:01.4463578Z 
2025-05-07T20:25:01.4463582Z 
2025-05-07T20:25:01.4463592Z 
2025-05-07T20:25:01.4463595Z 
2025-05-07T20:25:01.4463599Z 
2025-05-07T20:25:01.4463603Z 
2025-05-07T20:25:01.4463606Z 
2025-05-07T20:25:01.4465263Z 
2025-05-07T20:25:01.4588206Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:01.4588619Z openssl-3.5.0        | 3.0 MB    | ########## | 100% 
2025-05-07T20:25:01.4663522Z openssl-3.5.0        | 3.0 MB    | ########## | 100% 
2025-05-07T20:25:01.4663757Z 
2025-05-07T20:25:01.4663761Z 
2025-05-07T20:25:01.4663765Z 
2025-05-07T20:25:01.4663769Z 
2025-05-07T20:25:01.4663772Z 
2025-05-07T20:25:01.4663776Z 
2025-05-07T20:25:01.4663780Z 
2025-05-07T20:25:01.4664944Z 
2025-05-07T20:25:01.4858750Z typing_extensions-4. | 51 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:01.4859044Z 
2025-05-07T20:25:01.4859048Z 
2025-05-07T20:25:01.4859062Z 
2025-05-07T20:25:01.4859065Z 
2025-05-07T20:25:01.4863034Z cffi-1.17.1          | 289 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:25:01.4863319Z 
2025-05-07T20:25:01.4863325Z 
2025-05-07T20:25:01.4863330Z 
2025-05-07T20:25:01.4863335Z 
2025-05-07T20:25:01.4986274Z cffi-1.17.1          | 289 KB    | ########## | 100% [A[A[A[A
2025-05-07T20:25:01.4986601Z 
2025-05-07T20:25:01.4986607Z 
2025-05-07T20:25:01.4986612Z 
2025-05-07T20:25:01.4986617Z 
2025-05-07T20:25:01.4986623Z 
2025-05-07T20:25:01.4986628Z 
2025-05-07T20:25:01.4986713Z 
2025-05-07T20:25:01.5134407Z typing-extensions-4. | 88 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:01.5134768Z 
2025-05-07T20:25:01.5134772Z 
2025-05-07T20:25:01.5588243Z libgcc-15.1.0        | 810 KB    | ########## | 100% [A[A
2025-05-07T20:25:01.5588586Z 
2025-05-07T20:25:01.5588593Z 
2025-05-07T20:25:01.5588598Z 
2025-05-07T20:25:01.5588604Z 
2025-05-07T20:25:01.5588609Z 
2025-05-07T20:25:01.5588614Z 
2025-05-07T20:25:01.5588619Z 
2025-05-07T20:25:01.5588635Z 
2025-05-07T20:25:01.5588650Z 
2025-05-07T20:25:01.5591983Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:01.5592346Z 
2025-05-07T20:25:01.5592350Z 
2025-05-07T20:25:01.5592354Z 
2025-05-07T20:25:01.5592357Z 
2025-05-07T20:25:01.5592377Z 
2025-05-07T20:25:01.5592381Z 
2025-05-07T20:25:01.5592385Z 
2025-05-07T20:25:01.5592389Z 
2025-05-07T20:25:01.5592567Z 
2025-05-07T20:25:01.5846395Z libgcc-ng-15.1.0     | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:01.5846798Z 
2025-05-07T20:25:01.5846805Z 
2025-05-07T20:25:01.5846810Z 
2025-05-07T20:25:01.5846815Z 
2025-05-07T20:25:01.5846820Z 
2025-05-07T20:25:01.5847160Z 
2025-05-07T20:25:01.5851203Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:01.5851586Z 
2025-05-07T20:25:01.5851590Z 
2025-05-07T20:25:01.5851594Z 
2025-05-07T20:25:01.5851598Z 
2025-05-07T20:25:01.5851601Z 
2025-05-07T20:25:01.5851605Z 
2025-05-07T20:25:01.6849610Z pycparser-2.22       | 108 KB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:01.6882261Z openssl-3.5.0        | 3.0 MB    | ########## | 100% 
2025-05-07T20:25:01.6882585Z 
2025-05-07T20:25:01.6883388Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A
2025-05-07T20:25:01.6883974Z 
2025-05-07T20:25:01.6890790Z cryptography-44.0.3  | 1.5 MB    | ########## | 100% [A
2025-05-07T20:25:01.6891279Z                                                      
2025-05-07T20:25:01.6891552Z 
2025-05-07T20:25:01.6891783Z                                                      [A
2025-05-07T20:25:01.6892071Z 
2025-05-07T20:25:01.6892077Z 
2025-05-07T20:25:01.6892299Z                                                      [A[A
2025-05-07T20:25:01.6892575Z 
2025-05-07T20:25:01.6892580Z 
2025-05-07T20:25:01.6892586Z 
2025-05-07T20:25:01.6892824Z                                                      [A[A[A
2025-05-07T20:25:01.6893099Z 
2025-05-07T20:25:01.6893103Z 
2025-05-07T20:25:01.6893107Z 
2025-05-07T20:25:01.6893282Z 
2025-05-07T20:25:01.6893477Z                                                      [A[A[A[A
2025-05-07T20:25:01.6893764Z 
2025-05-07T20:25:01.6893770Z 
2025-05-07T20:25:01.6893775Z 
2025-05-07T20:25:01.6893781Z 
2025-05-07T20:25:01.6893786Z 
2025-05-07T20:25:01.6894089Z                                                      [A[A[A[A[A
2025-05-07T20:25:01.6894377Z 
2025-05-07T20:25:01.6894383Z 
2025-05-07T20:25:01.6894388Z 
2025-05-07T20:25:01.6894393Z 
2025-05-07T20:25:01.6894399Z 
2025-05-07T20:25:01.6894404Z 
2025-05-07T20:25:01.6894662Z                                                      [A[A[A[A[A[A
2025-05-07T20:25:01.6894960Z 
2025-05-07T20:25:01.6894966Z 
2025-05-07T20:25:01.6894971Z 
2025-05-07T20:25:01.6894977Z 
2025-05-07T20:25:01.6894982Z 
2025-05-07T20:25:01.6894988Z 
2025-05-07T20:25:01.6894993Z 
2025-05-07T20:25:01.6895258Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:25:01.6895560Z 
2025-05-07T20:25:01.6895574Z 
2025-05-07T20:25:01.6895579Z 
2025-05-07T20:25:01.6895584Z 
2025-05-07T20:25:01.6895589Z 
2025-05-07T20:25:01.6895595Z 
2025-05-07T20:25:01.6895600Z 
2025-05-07T20:25:01.6895605Z 
2025-05-07T20:25:01.6895797Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:25:01.6896015Z 
2025-05-07T20:25:01.6896019Z 
2025-05-07T20:25:01.6896023Z 
2025-05-07T20:25:01.6896026Z 
2025-05-07T20:25:01.6896030Z 
2025-05-07T20:25:01.6896034Z 
2025-05-07T20:25:01.6896037Z 
2025-05-07T20:25:01.6896050Z 
2025-05-07T20:25:01.6896054Z 
2025-05-07T20:25:01.6896245Z                                                      [A[A[A[A[A[A[A[A[A done
2025-05-07T20:25:01.7896570Z Preparing transaction: \ done
2025-05-07T20:25:01.8899364Z Verifying transaction: / done
2025-05-07T20:25:03.3923534Z Executing transaction: \ | / - \ | / - \ | / - \ | / done
2025-05-07T20:25:03.5665158Z [SETUP] Testing pyOpenSSL import ...
2025-05-07T20:25:05.2651250Z [CHECK] Python (sub-)package 'OpenSSL' found ...
2025-05-07T20:25:05.2664513Z [SETUP] Installing libxcrypt ...
2025-05-07T20:25:05.2687573Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y libxcrypt
2025-05-07T20:25:06.1279464Z Channels:
2025-05-07T20:25:06.1279755Z  - conda-forge
2025-05-07T20:25:06.1279983Z Platform: linux-64
2025-05-07T20:25:09.3829552Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:25:09.7480806Z Solving environment: \ | done
2025-05-07T20:25:09.8085202Z 
2025-05-07T20:25:09.8085780Z ## Package Plan ##
2025-05-07T20:25:09.8086024Z 
2025-05-07T20:25:09.8086337Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:25:09.8086757Z 
2025-05-07T20:25:09.8086875Z   added / updated specs:
2025-05-07T20:25:09.8087188Z     - libxcrypt
2025-05-07T20:25:09.8087347Z 
2025-05-07T20:25:09.8087353Z 
2025-05-07T20:25:09.8087510Z The following packages will be downloaded:
2025-05-07T20:25:09.8087770Z 
2025-05-07T20:25:09.8087917Z     package                    |            build
2025-05-07T20:25:09.8088224Z     ---------------------------|-----------------
2025-05-07T20:25:09.8088592Z     libxcrypt-4.4.36           |       hd590300_1          98 KB  conda-forge
2025-05-07T20:25:09.8089381Z     ------------------------------------------------------------
2025-05-07T20:25:09.8089711Z                                            Total:          98 KB
2025-05-07T20:25:09.8089922Z 
2025-05-07T20:25:09.8090044Z The following NEW packages will be INSTALLED:
2025-05-07T20:25:09.8090265Z 
2025-05-07T20:25:09.8090481Z   libxcrypt          conda-forge/linux-64::libxcrypt-4.4.36-hd590300_1 
2025-05-07T20:25:09.8090764Z 
2025-05-07T20:25:09.8090768Z 
2025-05-07T20:25:09.8090772Z 
2025-05-07T20:25:09.8090923Z Downloading and Extracting Packages: ...working...
2025-05-07T20:25:09.9993441Z libxcrypt-4.4.36     | 98 KB     |            |   0% 
2025-05-07T20:25:10.0025630Z libxcrypt-4.4.36     | 98 KB     | #6         |  16% 
2025-05-07T20:25:10.0126080Z libxcrypt-4.4.36     | 98 KB     | ########## | 100% 
2025-05-07T20:25:10.0128341Z libxcrypt-4.4.36     | 98 KB     | ########## | 100% 
2025-05-07T20:25:10.0128841Z                                                      
2025-05-07T20:25:10.0129147Z  done
2025-05-07T20:25:10.1132001Z Preparing transaction: - done
2025-05-07T20:25:10.2136390Z Verifying transaction: | done
2025-05-07T20:25:10.3143064Z Executing transaction: - done
2025-05-07T20:25:13.7219009Z [SETUP] Copying <crypt.h> over ...
2025-05-07T20:25:13.7219766Z + cp /home/ec2-user/miniconda/envs/build_binary/include/crypt.h /home/ec2-user/miniconda/envs/build_binary/include/python3.13/crypt.h
2025-05-07T20:25:13.7220384Z 
2025-05-07T20:25:13.7252946Z 
2025-05-07T20:25:15.3584414Z [SETUP] Installed Python version: Python 3.13.2
2025-05-07T20:25:15.3584879Z [SETUP] Successfully created Conda environment: build_binary
2025-05-07T20:25:15.3617145Z ##[group]Run . $PRELUDE; install_cxx_compiler $BUILD_ENV gcc
2025-05-07T20:25:15.3617620Z [36;1m. $PRELUDE; install_cxx_compiler $BUILD_ENV gcc[0m
2025-05-07T20:25:15.3630110Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:25:15.3630453Z env:
2025-05-07T20:25:15.3630672Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:25:15.3630971Z   BUILD_ENV: build_binary
2025-05-07T20:25:15.3631212Z   BUILD_TARGET: genai
2025-05-07T20:25:15.3631438Z   BUILD_VARIANT: cuda
2025-05-07T20:25:15.3631674Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:25:15.3641567Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:25:15.3641906Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:25:15.3642235Z ##[endgroup]
2025-05-07T20:25:15.7032457Z ################################################################################
2025-05-07T20:25:15.7032833Z # Install C/C++ Compilers
2025-05-07T20:25:15.7033070Z #
2025-05-07T20:25:15.7049712Z # [2025-05-07T20:25:15.704Z] + install_cxx_compiler build_binary gcc
2025-05-07T20:25:15.7050148Z ################################################################################
2025-05-07T20:25:15.7050377Z 
2025-05-07T20:25:15.7066532Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:25:15.7956116Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:25:15.7966737Z [INSTALL] Installing GLIBC (architecture = 64) ...
2025-05-07T20:25:15.7990097Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y sysroot_linux-64=2.17
2025-05-07T20:25:16.6639306Z Channels:
2025-05-07T20:25:16.6639552Z  - conda-forge
2025-05-07T20:25:16.6639780Z Platform: linux-64
2025-05-07T20:25:19.9451069Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:25:20.3116376Z Solving environment: \ done
2025-05-07T20:25:20.3728408Z 
2025-05-07T20:25:20.3728608Z ## Package Plan ##
2025-05-07T20:25:20.3728765Z 
2025-05-07T20:25:20.3728997Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:25:20.3729315Z 
2025-05-07T20:25:20.3729414Z   added / updated specs:
2025-05-07T20:25:20.3729683Z     - sysroot_linux-64=2.17
2025-05-07T20:25:20.3729845Z 
2025-05-07T20:25:20.3729850Z 
2025-05-07T20:25:20.3729967Z The following packages will be downloaded:
2025-05-07T20:25:20.3730187Z 
2025-05-07T20:25:20.3730298Z     package                    |            build
2025-05-07T20:25:20.3730614Z     ---------------------------|-----------------
2025-05-07T20:25:20.3731035Z     kernel-headers_linux-64-3.10.0|      he073ed8_18         921 KB  conda-forge
2025-05-07T20:25:20.3731503Z     sysroot_linux-64-2.17      |      h0157908_18        14.5 MB  conda-forge
2025-05-07T20:25:20.3731907Z     ------------------------------------------------------------
2025-05-07T20:25:20.3732237Z                                            Total:        15.4 MB
2025-05-07T20:25:20.3732443Z 
2025-05-07T20:25:20.3732567Z The following NEW packages will be INSTALLED:
2025-05-07T20:25:20.3732790Z 
2025-05-07T20:25:20.3733066Z   kernel-headers_li~ conda-forge/noarch::kernel-headers_linux-64-3.10.0-he073ed8_18 
2025-05-07T20:25:20.3733926Z   sysroot_linux-64   conda-forge/noarch::sysroot_linux-64-2.17-h0157908_18 
2025-05-07T20:25:20.3734224Z 
2025-05-07T20:25:20.3734228Z 
2025-05-07T20:25:20.3734232Z 
2025-05-07T20:25:20.3734382Z Downloading and Extracting Packages: ...working...
2025-05-07T20:25:20.3734740Z sysroot_linux-64-2.1 | 14.5 MB   |            |   0% 
2025-05-07T20:25:20.3734981Z 
2025-05-07T20:25:20.4799453Z kernel-headers_linux | 921 KB    |            |   0% [A
2025-05-07T20:25:20.4799752Z 
2025-05-07T20:25:20.4834373Z kernel-headers_linux | 921 KB    | ####8      |  49% [A
2025-05-07T20:25:20.4835261Z 
2025-05-07T20:25:20.5544943Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:25:20.6613613Z sysroot_linux-64-2.1 | 14.5 MB   |            |   0% 
2025-05-07T20:25:20.6871468Z sysroot_linux-64-2.1 | 14.5 MB   | ##6        |  26% 
2025-05-07T20:25:20.6872032Z 
2025-05-07T20:25:20.6872714Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:25:20.6872980Z 
2025-05-07T20:25:20.7641256Z kernel-headers_linux | 921 KB    | ########## | 100% [A
2025-05-07T20:25:20.8478952Z sysroot_linux-64-2.1 | 14.5 MB   | ######6    |  66% 
2025-05-07T20:25:21.3693201Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:25:21.3693944Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:25:21.3698923Z sysroot_linux-64-2.1 | 14.5 MB   | ########## | 100% 
2025-05-07T20:25:21.3699824Z                                                      
2025-05-07T20:25:21.3700369Z 
2025-05-07T20:25:21.3700955Z                                                      [A done
2025-05-07T20:25:21.4703827Z Preparing transaction: / done
2025-05-07T20:25:21.6711668Z Verifying transaction: \ | done
2025-05-07T20:25:21.8819831Z Executing transaction: - \ done
2025-05-07T20:25:22.0337280Z [CHECK] LD_LIBRARY_PATH = 
2025-05-07T20:25:22.0337633Z [CHECK] CONDA_PREFIX is not set.
2025-05-07T20:25:23.7083386Z [CHECK] libstdc++.so.6 found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libstdc++.so.6
2025-05-07T20:25:23.7101185Z [INSTALL] Installing GCC (11.4.0, 64) through Conda ...
2025-05-07T20:25:23.7124302Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y gxx_linux-64=11.4.0
2025-05-07T20:25:24.6057974Z Channels:
2025-05-07T20:25:24.6058304Z  - conda-forge
2025-05-07T20:25:24.6058598Z Platform: linux-64
2025-05-07T20:25:27.8836719Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:25:28.8284428Z Solving environment: \ | / done
2025-05-07T20:25:28.8910550Z 
2025-05-07T20:25:28.8911151Z ## Package Plan ##
2025-05-07T20:25:28.8911394Z 
2025-05-07T20:25:28.8911698Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:25:28.8912005Z 
2025-05-07T20:25:28.8912096Z   added / updated specs:
2025-05-07T20:25:28.8912367Z     - gxx_linux-64=11.4.0
2025-05-07T20:25:28.8912520Z 
2025-05-07T20:25:28.8912525Z 
2025-05-07T20:25:28.8912655Z The following packages will be downloaded:
2025-05-07T20:25:28.8912872Z 
2025-05-07T20:25:28.8912982Z     package                    |            build
2025-05-07T20:25:28.8913287Z     ---------------------------|-----------------
2025-05-07T20:25:28.8913674Z     binutils_impl_linux-64-2.40|       ha1999f0_7         6.0 MB  conda-forge
2025-05-07T20:25:28.8914146Z     binutils_linux-64-2.40     |       hb3c18ed_4          28 KB  conda-forge
2025-05-07T20:25:28.8914590Z     gcc_impl_linux-64-11.4.0   |      h00c12a0_13        53.0 MB  conda-forge
2025-05-07T20:25:28.8915019Z     gcc_linux-64-11.4.0        |       ha077dfb_4          31 KB  conda-forge
2025-05-07T20:25:28.8915433Z     gxx_impl_linux-64-11.4.0   |      h634f3ee_13        11.2 MB  conda-forge
2025-05-07T20:25:28.8915857Z     gxx_linux-64-11.4.0        |       h35bfe5d_4          29 KB  conda-forge
2025-05-07T20:25:28.8916274Z     ld_impl_linux-64-2.40      |       hf3520f5_7         691 KB  conda-forge
2025-05-07T20:25:28.8916723Z     libgcc-devel_linux-64-11.4.0|     h8f596e0_113         2.3 MB  conda-forge
2025-05-07T20:25:28.8917507Z     libsanitizer-11.4.0        |      h5763a12_13         3.5 MB  conda-forge
2025-05-07T20:25:28.8917929Z     libstdcxx-15.1.0           |       h8f9b012_2         3.7 MB  conda-forge
2025-05-07T20:25:28.8918395Z     libstdcxx-devel_linux-64-11.4.0|     h8f596e0_113        11.1 MB  conda-forge
2025-05-07T20:25:28.8918863Z     libstdcxx-ng-15.1.0        |       h4852527_2          34 KB  conda-forge
2025-05-07T20:25:28.8919251Z     ------------------------------------------------------------
2025-05-07T20:25:28.8919594Z                                            Total:        91.6 MB
2025-05-07T20:25:28.8919795Z 
2025-05-07T20:25:28.8919921Z The following NEW packages will be INSTALLED:
2025-05-07T20:25:28.8920130Z 
2025-05-07T20:25:28.8920563Z   binutils_impl_lin~ conda-forge/linux-64::binutils_impl_linux-64-2.40-ha1999f0_7 
2025-05-07T20:25:28.8921125Z   binutils_linux-64  conda-forge/linux-64::binutils_linux-64-2.40-hb3c18ed_4 
2025-05-07T20:25:28.8921653Z   gcc_impl_linux-64  conda-forge/linux-64::gcc_impl_linux-64-11.4.0-h00c12a0_13 
2025-05-07T20:25:28.8922143Z   gcc_linux-64       conda-forge/linux-64::gcc_linux-64-11.4.0-ha077dfb_4 
2025-05-07T20:25:28.8922628Z   gxx_impl_linux-64  conda-forge/linux-64::gxx_impl_linux-64-11.4.0-h634f3ee_13 
2025-05-07T20:25:28.8923115Z   gxx_linux-64       conda-forge/linux-64::gxx_linux-64-11.4.0-h35bfe5d_4 
2025-05-07T20:25:28.8923623Z   libgcc-devel_linu~ conda-forge/noarch::libgcc-devel_linux-64-11.4.0-h8f596e0_113 
2025-05-07T20:25:28.8924179Z   libsanitizer       conda-forge/linux-64::libsanitizer-11.4.0-h5763a12_13 
2025-05-07T20:25:28.8924659Z   libstdcxx          conda-forge/linux-64::libstdcxx-15.1.0-h8f9b012_2 
2025-05-07T20:25:28.8925203Z   libstdcxx-devel_l~ conda-forge/noarch::libstdcxx-devel_linux-64-11.4.0-h8f596e0_113 
2025-05-07T20:25:28.8925561Z 
2025-05-07T20:25:28.8925669Z The following packages will be UPDATED:
2025-05-07T20:25:28.8925864Z 
2025-05-07T20:25:28.8926181Z   ld_impl_linux-64   pkgs/main::ld_impl_linux-64-2.40-h12e~ --> conda-forge::ld_impl_linux-64-2.40-hf3520f5_7 
2025-05-07T20:25:28.8926898Z   libstdcxx-ng       pkgs/main::libstdcxx-ng-11.2.0-h12345~ --> conda-forge::libstdcxx-ng-15.1.0-h4852527_2 
2025-05-07T20:25:28.8927299Z 
2025-05-07T20:25:28.8927304Z 
2025-05-07T20:25:28.8927308Z 
2025-05-07T20:25:28.8927446Z Downloading and Extracting Packages: ...working...
2025-05-07T20:25:28.8927804Z gcc_impl_linux-64-11 | 53.0 MB   |            |   0% 
2025-05-07T20:25:28.8928028Z 
2025-05-07T20:25:28.8928426Z gxx_impl_linux-64-11 | 11.2 MB   |            |   0% [A
2025-05-07T20:25:28.8928651Z 
2025-05-07T20:25:28.8928655Z 
2025-05-07T20:25:28.8931704Z libstdcxx-devel_linu | 11.1 MB   |            |   0% [A[A
2025-05-07T20:25:28.8931963Z 
2025-05-07T20:25:28.8931967Z 
2025-05-07T20:25:28.8942533Z 
2025-05-07T20:25:28.8985905Z binutils_impl_linux- | 6.0 MB    |            |   0% [A[A[A
2025-05-07T20:25:28.8986238Z 
2025-05-07T20:25:28.8986247Z 
2025-05-07T20:25:28.8986251Z 
2025-05-07T20:25:28.8994188Z 
2025-05-07T20:25:28.9007707Z libstdcxx-15.1.0     | 3.7 MB    |            |   0% [A[A[A[A
2025-05-07T20:25:28.9008035Z 
2025-05-07T20:25:28.9008039Z 
2025-05-07T20:25:28.9008042Z 
2025-05-07T20:25:28.9008046Z 
2025-05-07T20:25:28.9008050Z 
2025-05-07T20:25:28.9009585Z libsanitizer-11.4.0  | 3.5 MB    |            |   0% [A[A[A[A[A
2025-05-07T20:25:28.9009997Z 
2025-05-07T20:25:28.9010004Z 
2025-05-07T20:25:28.9010009Z 
2025-05-07T20:25:28.9010014Z 
2025-05-07T20:25:28.9010020Z 
2025-05-07T20:25:28.9010026Z 
2025-05-07T20:25:28.9013818Z libgcc-devel_linux-6 | 2.3 MB    |            |   0% [A[A[A[A[A[A
2025-05-07T20:25:28.9014106Z 
2025-05-07T20:25:28.9014110Z 
2025-05-07T20:25:28.9014113Z 
2025-05-07T20:25:28.9014117Z 
2025-05-07T20:25:28.9014121Z 
2025-05-07T20:25:28.9014124Z 
2025-05-07T20:25:28.9014969Z 
2025-05-07T20:25:28.9016622Z ld_impl_linux-64-2.4 | 691 KB    |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:25:28.9017323Z 
2025-05-07T20:25:28.9017327Z 
2025-05-07T20:25:28.9017331Z 
2025-05-07T20:25:28.9017335Z 
2025-05-07T20:25:28.9017339Z 
2025-05-07T20:25:28.9017342Z 
2025-05-07T20:25:28.9017346Z 
2025-05-07T20:25:28.9017360Z 
2025-05-07T20:25:28.9022205Z libstdcxx-ng-15.1.0  | 34 KB     |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:25:28.9022587Z 
2025-05-07T20:25:28.9022591Z 
2025-05-07T20:25:28.9022595Z 
2025-05-07T20:25:28.9022598Z 
2025-05-07T20:25:28.9022609Z 
2025-05-07T20:25:28.9022613Z 
2025-05-07T20:25:28.9022616Z 
2025-05-07T20:25:28.9022620Z 
2025-05-07T20:25:28.9025480Z 
2025-05-07T20:25:28.9027140Z gcc_linux-64-11.4.0  | 31 KB     |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.9027469Z 
2025-05-07T20:25:28.9027473Z 
2025-05-07T20:25:28.9027477Z 
2025-05-07T20:25:28.9027480Z 
2025-05-07T20:25:28.9027484Z 
2025-05-07T20:25:28.9027488Z 
2025-05-07T20:25:28.9027759Z 
2025-05-07T20:25:28.9027765Z 
2025-05-07T20:25:28.9027769Z 
2025-05-07T20:25:28.9028583Z 
2025-05-07T20:25:28.9030501Z gxx_linux-64-11.4.0  | 29 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.9030875Z 
2025-05-07T20:25:28.9030880Z 
2025-05-07T20:25:28.9030884Z 
2025-05-07T20:25:28.9030887Z 
2025-05-07T20:25:28.9030891Z 
2025-05-07T20:25:28.9030895Z 
2025-05-07T20:25:28.9030898Z 
2025-05-07T20:25:28.9030902Z 
2025-05-07T20:25:28.9030906Z 
2025-05-07T20:25:28.9030909Z 
2025-05-07T20:25:28.9030913Z 
2025-05-07T20:25:28.9992376Z binutils_linux-64-2. | 28 KB     |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:28.9992783Z 
2025-05-07T20:25:28.9992787Z 
2025-05-07T20:25:29.0103134Z 
2025-05-07T20:25:29.0337792Z binutils_impl_linux- | 6.0 MB    |            |   0% [A[A[A
2025-05-07T20:25:29.0338067Z 
2025-05-07T20:25:29.0338296Z 
2025-05-07T20:25:29.0338315Z 
2025-05-07T20:25:29.0338456Z 
2025-05-07T20:25:29.0393410Z libstdcxx-15.1.0     | 3.7 MB    |            |   0% [A[A[A[A
2025-05-07T20:25:29.0393699Z 
2025-05-07T20:25:29.0394274Z 
2025-05-07T20:25:29.1061857Z libstdcxx-devel_linu | 11.1 MB   |            |   0% [A[A
2025-05-07T20:25:29.1062141Z 
2025-05-07T20:25:29.1062146Z 
2025-05-07T20:25:29.1062149Z 
2025-05-07T20:25:29.1167804Z binutils_impl_linux- | 6.0 MB    | ###        |  31% [A[A[A
2025-05-07T20:25:29.1168070Z 
2025-05-07T20:25:29.1232466Z gxx_impl_linux-64-11 | 11.2 MB   |            |   0% [A
2025-05-07T20:25:29.1340386Z gcc_impl_linux-64-11 | 53.0 MB   |            |   0% 
2025-05-07T20:25:29.1340625Z 
2025-05-07T20:25:29.1340971Z 
2025-05-07T20:25:29.1340978Z 
2025-05-07T20:25:29.1344906Z 
2025-05-07T20:25:29.1395039Z libstdcxx-15.1.0     | 3.7 MB    | ########9  |  89% [A[A[A[A
2025-05-07T20:25:29.1395409Z 
2025-05-07T20:25:29.1397046Z 
2025-05-07T20:25:29.2169703Z libstdcxx-devel_linu | 11.1 MB   | ####6      |  46% [A[A
2025-05-07T20:25:29.2170072Z 
2025-05-07T20:25:29.2233198Z gxx_impl_linux-64-11 | 11.2 MB   | ###8       |  38% [A
2025-05-07T20:25:29.2343007Z gcc_impl_linux-64-11 | 53.0 MB   | 7          |   7% 
2025-05-07T20:25:29.2343405Z 
2025-05-07T20:25:29.2343412Z 
2025-05-07T20:25:29.2343417Z 
2025-05-07T20:25:29.2343422Z 
2025-05-07T20:25:29.2396723Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:25:29.2397095Z 
2025-05-07T20:25:29.2397740Z 
2025-05-07T20:25:29.2779922Z libstdcxx-devel_linu | 11.1 MB   | #######9   |  80% [A[A
2025-05-07T20:25:29.2780284Z 
2025-05-07T20:25:29.2780290Z 
2025-05-07T20:25:29.2780295Z 
2025-05-07T20:25:29.2780300Z 
2025-05-07T20:25:29.2780306Z 
2025-05-07T20:25:29.3170664Z libsanitizer-11.4.0  | 3.5 MB    |            |   0% [A[A[A[A[A
2025-05-07T20:25:29.3171052Z 
2025-05-07T20:25:29.3233596Z gxx_impl_linux-64-11 | 11.2 MB   | #######2   |  73% [A
2025-05-07T20:25:29.3399655Z gcc_impl_linux-64-11 | 53.0 MB   | #4         |  15% 
2025-05-07T20:25:29.3399993Z 
2025-05-07T20:25:29.3400000Z 
2025-05-07T20:25:29.3405706Z 
2025-05-07T20:25:29.3414873Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:25:29.3415246Z 
2025-05-07T20:25:29.3415457Z 
2025-05-07T20:25:29.3417171Z 
2025-05-07T20:25:29.3782965Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:25:29.3783346Z 
2025-05-07T20:25:29.3783352Z 
2025-05-07T20:25:29.3783357Z 
2025-05-07T20:25:29.3783362Z 
2025-05-07T20:25:29.3793066Z 
2025-05-07T20:25:29.3869914Z libsanitizer-11.4.0  | 3.5 MB    | ########   |  81% [A[A[A[A[A
2025-05-07T20:25:29.3870328Z 
2025-05-07T20:25:29.3870334Z 
2025-05-07T20:25:29.3870339Z 
2025-05-07T20:25:29.3870345Z 
2025-05-07T20:25:29.3870350Z 
2025-05-07T20:25:29.3870355Z 
2025-05-07T20:25:29.4235734Z libgcc-devel_linux-6 | 2.3 MB    |            |   1% [A[A[A[A[A[A
2025-05-07T20:25:29.5080754Z gcc_impl_linux-64-11 | 53.0 MB   | ##         |  21% 
2025-05-07T20:25:29.5081086Z 
2025-05-07T20:25:29.5081092Z 
2025-05-07T20:25:29.5081098Z 
2025-05-07T20:25:29.5081103Z 
2025-05-07T20:25:29.5083644Z 
2025-05-07T20:25:29.5238317Z libsanitizer-11.4.0  | 3.5 MB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:29.5303035Z gcc_impl_linux-64-11 | 53.0 MB   | ##8        |  28% 
2025-05-07T20:25:29.5303293Z 
2025-05-07T20:25:29.5303640Z 
2025-05-07T20:25:29.5303647Z 
2025-05-07T20:25:29.5303653Z 
2025-05-07T20:25:29.5303659Z 
2025-05-07T20:25:29.5305829Z 
2025-05-07T20:25:29.5306437Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:29.5306827Z 
2025-05-07T20:25:29.5306832Z 
2025-05-07T20:25:29.5306838Z 
2025-05-07T20:25:29.5306843Z 
2025-05-07T20:25:29.5306848Z 
2025-05-07T20:25:29.5306862Z 
2025-05-07T20:25:29.5481532Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:29.5481927Z 
2025-05-07T20:25:29.5481932Z 
2025-05-07T20:25:29.5481938Z 
2025-05-07T20:25:29.5481943Z 
2025-05-07T20:25:29.5481949Z 
2025-05-07T20:25:29.5481962Z 
2025-05-07T20:25:29.5483631Z 
2025-05-07T20:25:29.5714776Z ld_impl_linux-64-2.4 | 691 KB    | 2          |   2% [A[A[A[A[A[A[A
2025-05-07T20:25:29.5715185Z 
2025-05-07T20:25:29.5715189Z 
2025-05-07T20:25:29.5715204Z 
2025-05-07T20:25:29.5715215Z 
2025-05-07T20:25:29.5715219Z 
2025-05-07T20:25:29.5715222Z 
2025-05-07T20:25:29.5715226Z 
2025-05-07T20:25:29.5717011Z 
2025-05-07T20:25:29.5746217Z libstdcxx-ng-15.1.0  | 34 KB     | ####7      |  47% [A[A[A[A[A[A[A[A
2025-05-07T20:25:29.5746680Z 
2025-05-07T20:25:29.5746686Z 
2025-05-07T20:25:29.5746691Z 
2025-05-07T20:25:29.5746696Z 
2025-05-07T20:25:29.5746701Z 
2025-05-07T20:25:29.5746706Z 
2025-05-07T20:25:29.5746712Z 
2025-05-07T20:25:29.5748790Z 
2025-05-07T20:25:29.5927858Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:29.5928230Z 
2025-05-07T20:25:29.5928235Z 
2025-05-07T20:25:29.5928241Z 
2025-05-07T20:25:29.5928246Z 
2025-05-07T20:25:29.5928252Z 
2025-05-07T20:25:29.5928257Z 
2025-05-07T20:25:29.5928262Z 
2025-05-07T20:25:29.6092260Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:29.6092611Z 
2025-05-07T20:25:29.6092615Z 
2025-05-07T20:25:29.6092619Z 
2025-05-07T20:25:29.6092629Z 
2025-05-07T20:25:29.6092632Z 
2025-05-07T20:25:29.6092636Z 
2025-05-07T20:25:29.6092640Z 
2025-05-07T20:25:29.6092643Z 
2025-05-07T20:25:29.6095515Z 
2025-05-07T20:25:29.6144569Z gcc_linux-64-11.4.0  | 31 KB     | #####2     |  52% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:29.6144885Z 
2025-05-07T20:25:29.6144889Z 
2025-05-07T20:25:29.6144893Z 
2025-05-07T20:25:29.6144896Z 
2025-05-07T20:25:29.6144900Z 
2025-05-07T20:25:29.6144904Z 
2025-05-07T20:25:29.6144907Z 
2025-05-07T20:25:29.6144911Z 
2025-05-07T20:25:29.6148943Z 
2025-05-07T20:25:29.6283416Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:29.6320290Z gcc_impl_linux-64-11 | 53.0 MB   | ###4       |  35% 
2025-05-07T20:25:29.6320638Z 
2025-05-07T20:25:29.6320644Z 
2025-05-07T20:25:29.6320649Z 
2025-05-07T20:25:29.6320655Z 
2025-05-07T20:25:29.6320660Z 
2025-05-07T20:25:29.6320665Z 
2025-05-07T20:25:29.6320680Z 
2025-05-07T20:25:29.6320684Z 
2025-05-07T20:25:29.6320688Z 
2025-05-07T20:25:29.6321949Z 
2025-05-07T20:25:29.6363189Z gxx_linux-64-11.4.0  | 29 KB     | #####5     |  55% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:29.6363474Z 
2025-05-07T20:25:29.6363478Z 
2025-05-07T20:25:29.6363482Z 
2025-05-07T20:25:29.6363486Z 
2025-05-07T20:25:29.6363489Z 
2025-05-07T20:25:29.6363500Z 
2025-05-07T20:25:29.6363504Z 
2025-05-07T20:25:29.6363508Z 
2025-05-07T20:25:29.6363511Z 
2025-05-07T20:25:29.6363515Z 
2025-05-07T20:25:29.6623708Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:29.6624005Z 
2025-05-07T20:25:29.6624009Z 
2025-05-07T20:25:29.6624013Z 
2025-05-07T20:25:29.6624016Z 
2025-05-07T20:25:29.6624020Z 
2025-05-07T20:25:29.6624024Z 
2025-05-07T20:25:29.6624028Z 
2025-05-07T20:25:29.6624031Z 
2025-05-07T20:25:29.6624035Z 
2025-05-07T20:25:29.6624039Z 
2025-05-07T20:25:29.6626567Z 
2025-05-07T20:25:29.6667882Z binutils_linux-64-2. | 28 KB     | #####6     |  56% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:29.6668491Z 
2025-05-07T20:25:29.6668502Z 
2025-05-07T20:25:29.6668506Z 
2025-05-07T20:25:29.6668510Z 
2025-05-07T20:25:29.6668514Z 
2025-05-07T20:25:29.6668517Z 
2025-05-07T20:25:29.6668521Z 
2025-05-07T20:25:29.6668525Z 
2025-05-07T20:25:29.6668528Z 
2025-05-07T20:25:29.6668532Z 
2025-05-07T20:25:29.6668536Z 
2025-05-07T20:25:29.6690363Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:29.6690781Z 
2025-05-07T20:25:29.6690787Z 
2025-05-07T20:25:29.7193114Z libstdcxx-devel_linu | 11.1 MB   | ########## | 100% [A[A
2025-05-07T20:25:29.7193831Z 
2025-05-07T20:25:29.7286234Z gxx_impl_linux-64-11 | 11.2 MB   | ########## | 100% [A
2025-05-07T20:25:29.7337816Z gcc_impl_linux-64-11 | 53.0 MB   | ####2      |  42% 
2025-05-07T20:25:29.7338108Z 
2025-05-07T20:25:29.7338114Z 
2025-05-07T20:25:29.7338120Z 
2025-05-07T20:25:29.7338630Z 
2025-05-07T20:25:29.8285352Z libstdcxx-15.1.0     | 3.7 MB    | ########## | 100% [A[A[A[A
2025-05-07T20:25:29.8961664Z gcc_impl_linux-64-11 | 53.0 MB   | #####4     |  55% 
2025-05-07T20:25:29.8962006Z 
2025-05-07T20:25:29.8962012Z 
2025-05-07T20:25:29.8962017Z 
2025-05-07T20:25:29.8962022Z 
2025-05-07T20:25:29.8963079Z 
2025-05-07T20:25:29.9288471Z libsanitizer-11.4.0  | 3.5 MB    | ########## | 100% [A[A[A[A[A
2025-05-07T20:25:29.9378709Z gcc_impl_linux-64-11 | 53.0 MB   | ######8    |  68% 
2025-05-07T20:25:29.9379048Z 
2025-05-07T20:25:29.9379054Z 
2025-05-07T20:25:29.9379060Z 
2025-05-07T20:25:29.9379065Z 
2025-05-07T20:25:29.9379070Z 
2025-05-07T20:25:29.9379461Z 
2025-05-07T20:25:29.9888388Z libgcc-devel_linux-6 | 2.3 MB    | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:25:29.9888761Z 
2025-05-07T20:25:29.9888765Z 
2025-05-07T20:25:29.9888769Z 
2025-05-07T20:25:29.9888773Z 
2025-05-07T20:25:29.9888777Z 
2025-05-07T20:25:29.9888780Z 
2025-05-07T20:25:29.9888784Z 
2025-05-07T20:25:29.9889941Z 
2025-05-07T20:25:29.9899172Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:29.9899487Z 
2025-05-07T20:25:29.9899497Z 
2025-05-07T20:25:29.9899508Z 
2025-05-07T20:25:29.9899512Z 
2025-05-07T20:25:29.9899516Z 
2025-05-07T20:25:29.9899520Z 
2025-05-07T20:25:29.9899523Z 
2025-05-07T20:25:29.9901389Z 
2025-05-07T20:25:30.0290600Z libstdcxx-ng-15.1.0  | 34 KB     | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:25:30.0340853Z gcc_impl_linux-64-11 | 53.0 MB   | #######9   |  79% 
2025-05-07T20:25:30.0341151Z 
2025-05-07T20:25:30.0341157Z 
2025-05-07T20:25:30.0341162Z 
2025-05-07T20:25:30.0341167Z 
2025-05-07T20:25:30.0341186Z 
2025-05-07T20:25:30.0341191Z 
2025-05-07T20:25:30.0341677Z 
2025-05-07T20:25:30.0347480Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:30.0347950Z 
2025-05-07T20:25:30.0347955Z 
2025-05-07T20:25:30.0347958Z 
2025-05-07T20:25:30.0347962Z 
2025-05-07T20:25:30.0347966Z 
2025-05-07T20:25:30.0347969Z 
2025-05-07T20:25:30.0349258Z 
2025-05-07T20:25:30.0508429Z ld_impl_linux-64-2.4 | 691 KB    | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:25:30.0508941Z 
2025-05-07T20:25:30.0508945Z 
2025-05-07T20:25:30.0508949Z 
2025-05-07T20:25:30.0508961Z 
2025-05-07T20:25:30.0508965Z 
2025-05-07T20:25:30.0508968Z 
2025-05-07T20:25:30.0508972Z 
2025-05-07T20:25:30.0508976Z 
2025-05-07T20:25:30.0515539Z 
2025-05-07T20:25:30.0522015Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:30.0522367Z 
2025-05-07T20:25:30.0522373Z 
2025-05-07T20:25:30.0522378Z 
2025-05-07T20:25:30.0522384Z 
2025-05-07T20:25:30.0522389Z 
2025-05-07T20:25:30.0522394Z 
2025-05-07T20:25:30.0522399Z 
2025-05-07T20:25:30.0522404Z 
2025-05-07T20:25:30.0522733Z 
2025-05-07T20:25:30.1140039Z gcc_linux-64-11.4.0  | 31 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:30.1140539Z 
2025-05-07T20:25:30.1140555Z 
2025-05-07T20:25:30.1140560Z 
2025-05-07T20:25:30.1140563Z 
2025-05-07T20:25:30.1140567Z 
2025-05-07T20:25:30.1140777Z 
2025-05-07T20:25:30.1140782Z 
2025-05-07T20:25:30.1140788Z 
2025-05-07T20:25:30.1140793Z 
2025-05-07T20:25:30.1140857Z 
2025-05-07T20:25:30.1144472Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:30.1144782Z 
2025-05-07T20:25:30.1144786Z 
2025-05-07T20:25:30.1144790Z 
2025-05-07T20:25:30.1144794Z 
2025-05-07T20:25:30.1144798Z 
2025-05-07T20:25:30.1144801Z 
2025-05-07T20:25:30.1144805Z 
2025-05-07T20:25:30.1144808Z 
2025-05-07T20:25:30.1144812Z 
2025-05-07T20:25:30.1144816Z 
2025-05-07T20:25:30.1173568Z gxx_linux-64-11.4.0  | 29 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:30.1173954Z 
2025-05-07T20:25:30.1173958Z 
2025-05-07T20:25:30.1173962Z 
2025-05-07T20:25:30.1173966Z 
2025-05-07T20:25:30.1173969Z 
2025-05-07T20:25:30.1173973Z 
2025-05-07T20:25:30.1173977Z 
2025-05-07T20:25:30.1173980Z 
2025-05-07T20:25:30.1173984Z 
2025-05-07T20:25:30.1173988Z 
2025-05-07T20:25:30.1173991Z 
2025-05-07T20:25:30.1179090Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:30.1179509Z 
2025-05-07T20:25:30.1179515Z 
2025-05-07T20:25:30.1179521Z 
2025-05-07T20:25:30.1179526Z 
2025-05-07T20:25:30.1179532Z 
2025-05-07T20:25:30.1179537Z 
2025-05-07T20:25:30.1179543Z 
2025-05-07T20:25:30.1179561Z 
2025-05-07T20:25:30.1179566Z 
2025-05-07T20:25:30.1179571Z 
2025-05-07T20:25:30.1179577Z 
2025-05-07T20:25:30.1329251Z binutils_linux-64-2. | 28 KB     | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:30.2922417Z gcc_impl_linux-64-11 | 53.0 MB   | ########9  |  90% 
2025-05-07T20:25:30.2922774Z 
2025-05-07T20:25:30.2922780Z 
2025-05-07T20:25:30.2924030Z 
2025-05-07T20:25:30.4077518Z binutils_impl_linux- | 6.0 MB    | ########## | 100% [A[A[A
2025-05-07T20:25:30.4595490Z gcc_impl_linux-64-11 | 53.0 MB   | ########## | 100% 
2025-05-07T20:25:30.4595758Z 
2025-05-07T20:25:30.7035317Z gxx_impl_linux-64-11 | 11.2 MB   | ########## | 100% [A
2025-05-07T20:25:30.7035594Z 
2025-05-07T20:25:30.7035599Z 
2025-05-07T20:25:31.1097162Z libstdcxx-devel_linu | 11.1 MB   | ########## | 100% [A[A
2025-05-07T20:25:31.1104047Z gcc_impl_linux-64-11 | 53.0 MB   | ########## | 100% 
2025-05-07T20:25:31.1104508Z                                                      
2025-05-07T20:25:31.1104765Z 
2025-05-07T20:25:31.1104975Z                                                      [A
2025-05-07T20:25:31.1105197Z 
2025-05-07T20:25:31.1105202Z 
2025-05-07T20:25:31.1105365Z                                                      [A[A
2025-05-07T20:25:31.1105578Z 
2025-05-07T20:25:31.1105583Z 
2025-05-07T20:25:31.1105587Z 
2025-05-07T20:25:31.1105751Z                                                      [A[A[A
2025-05-07T20:25:31.1105956Z 
2025-05-07T20:25:31.1105960Z 
2025-05-07T20:25:31.1105965Z 
2025-05-07T20:25:31.1105982Z 
2025-05-07T20:25:31.1106221Z                                                      [A[A[A[A
2025-05-07T20:25:31.1106528Z 
2025-05-07T20:25:31.1106534Z 
2025-05-07T20:25:31.1106551Z 
2025-05-07T20:25:31.1106557Z 
2025-05-07T20:25:31.1106562Z 
2025-05-07T20:25:31.1106833Z                                                      [A[A[A[A[A
2025-05-07T20:25:31.1107247Z 
2025-05-07T20:25:31.1107251Z 
2025-05-07T20:25:31.1107254Z 
2025-05-07T20:25:31.1107258Z 
2025-05-07T20:25:31.1107269Z 
2025-05-07T20:25:31.1107273Z 
2025-05-07T20:25:31.1107455Z                                                      [A[A[A[A[A[A
2025-05-07T20:25:31.1107748Z 
2025-05-07T20:25:31.1107752Z 
2025-05-07T20:25:31.1107755Z 
2025-05-07T20:25:31.1107759Z 
2025-05-07T20:25:31.1107763Z 
2025-05-07T20:25:31.1107774Z 
2025-05-07T20:25:31.1107777Z 
2025-05-07T20:25:31.1107952Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:25:31.1108159Z 
2025-05-07T20:25:31.1108163Z 
2025-05-07T20:25:31.1108167Z 
2025-05-07T20:25:31.1108170Z 
2025-05-07T20:25:31.1108174Z 
2025-05-07T20:25:31.1108184Z 
2025-05-07T20:25:31.1108188Z 
2025-05-07T20:25:31.1108192Z 
2025-05-07T20:25:31.1108498Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:25:31.1108718Z 
2025-05-07T20:25:31.1108722Z 
2025-05-07T20:25:31.1108726Z 
2025-05-07T20:25:31.1108735Z 
2025-05-07T20:25:31.1108739Z 
2025-05-07T20:25:31.1108743Z 
2025-05-07T20:25:31.1108746Z 
2025-05-07T20:25:31.1108750Z 
2025-05-07T20:25:31.1108754Z 
2025-05-07T20:25:31.1108934Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:25:31.1109142Z 
2025-05-07T20:25:31.1109151Z 
2025-05-07T20:25:31.1109155Z 
2025-05-07T20:25:31.1109159Z 
2025-05-07T20:25:31.1109162Z 
2025-05-07T20:25:31.1109166Z 
2025-05-07T20:25:31.1109170Z 
2025-05-07T20:25:31.1109173Z 
2025-05-07T20:25:31.1109177Z 
2025-05-07T20:25:31.1109181Z 
2025-05-07T20:25:31.1109364Z                                                      [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:25:31.1109589Z 
2025-05-07T20:25:31.1109593Z 
2025-05-07T20:25:31.1109596Z 
2025-05-07T20:25:31.1109600Z 
2025-05-07T20:25:31.1109610Z 
2025-05-07T20:25:31.1109614Z 
2025-05-07T20:25:31.1109617Z 
2025-05-07T20:25:31.1109621Z 
2025-05-07T20:25:31.1109629Z 
2025-05-07T20:25:31.1109633Z 
2025-05-07T20:25:31.1109637Z 
2025-05-07T20:25:31.1109843Z                                                      [A[A[A[A[A[A[A[A[A[A[A done
2025-05-07T20:25:31.2114737Z Preparing transaction: \ done
2025-05-07T20:25:31.5121403Z Verifying transaction: / - \ done
2025-05-07T20:25:31.6131283Z Executing transaction: / done
2025-05-07T20:25:31.7772925Z [INSTALL] Setting the C/C++ compiler symlinks ...
2025-05-07T20:25:35.6459597Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/cc
2025-05-07T20:25:35.6460154Z 
2025-05-07T20:25:35.6475198Z 
2025-05-07T20:25:35.6493357Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-cc /home/ec2-user/miniconda/envs/build_binary/bin/gcc
2025-05-07T20:25:35.6493893Z 
2025-05-07T20:25:35.6507001Z 
2025-05-07T20:25:35.6525614Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/c++
2025-05-07T20:25:35.6526147Z 
2025-05-07T20:25:35.6538853Z 
2025-05-07T20:25:35.6556804Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/bin/x86_64-conda-linux-gnu-c++ /home/ec2-user/miniconda/envs/build_binary/bin/g++
2025-05-07T20:25:35.6557318Z 
2025-05-07T20:25:35.6571284Z 
2025-05-07T20:25:37.5374379Z /home/ec2-user/miniconda/envs/build_binary/bin/cc
2025-05-07T20:25:37.5374784Z 
2025-05-07T20:25:37.5998249Z [CHECK] Binary cc found in PATH
2025-05-07T20:25:39.4792419Z /home/ec2-user/miniconda/envs/build_binary/bin/gcc
2025-05-07T20:25:39.4792748Z 
2025-05-07T20:25:39.5414587Z [CHECK] Binary gcc found in PATH
2025-05-07T20:25:41.4122328Z /home/ec2-user/miniconda/envs/build_binary/bin/c++
2025-05-07T20:25:41.4122631Z 
2025-05-07T20:25:41.4733749Z [CHECK] Binary c++ found in PATH
2025-05-07T20:25:43.3453152Z /home/ec2-user/miniconda/envs/build_binary/bin/g++
2025-05-07T20:25:43.3453425Z 
2025-05-07T20:25:43.4086091Z [CHECK] Binary g++ found in PATH
2025-05-07T20:25:43.4089952Z [INFO] Printing out all preprocessor defines in the C compiler ...
2025-05-07T20:25:43.4090435Z + conda run -n build_binary cc -dM -E -
2025-05-07T20:25:43.4090640Z 
2025-05-07T20:25:45.2937354Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:25:45.2937871Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:25:45.2938271Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:25:45.2938563Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:25:45.2938901Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:25:45.2939385Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:25:45.2939775Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:25:45.2940395Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:25:45.2940769Z #define __INTMAX_C(c) c ## L
2025-05-07T20:25:45.2941159Z #define __CHAR_BIT__ 8
2025-05-07T20:25:45.2941471Z #define __UINT8_MAX__ 0xff
2025-05-07T20:25:45.2942150Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:25:45.2942521Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:25:45.2942907Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:25:45.2943272Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:25:45.2943679Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:45.2943975Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:25:45.2944344Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:25:45.2944775Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:25:45.2945195Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:25:45.2945593Z #define __DBL_DENORM_MIN__ ((double)4.94065645841246544176568792868221372e-324L)
2025-05-07T20:25:45.2945997Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:25:45.2946296Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:25:45.2946559Z #define __GCC_IEC_559 2
2025-05-07T20:25:45.2946794Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:25:45.2947061Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:25:45.2947318Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:25:45.2947594Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:25:45.2948001Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:45.2948308Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:25:45.2948570Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:25:45.2948836Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:25:45.2949088Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:25:45.2949348Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:25:45.2949599Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:25:45.2949846Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:25:45.2950097Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:25:45.2950337Z #define __INT8_C(c) c
2025-05-07T20:25:45.2950569Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:25:45.2950850Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:45.2951156Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:25:45.2951460Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:45.2951802Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:25:45.2952070Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:25:45.2952334Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:45.2952597Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:25:45.2952866Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:25:45.2953250Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:25:45.2953655Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:25:45.2953928Z #define __linux 1
2025-05-07T20:25:45.2954150Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:25:45.2954422Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:25:45.2954687Z #define __unix 1
2025-05-07T20:25:45.2954906Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:25:45.2955175Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:25:45.2955435Z #define __WINT_MIN__ 0U
2025-05-07T20:25:45.2955672Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:25:45.2955949Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:25:45.2956207Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:25:45.2956466Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:25:45.2956893Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:25:45.2957162Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:25:45.2957445Z #define __INT64_C(c) c ## L
2025-05-07T20:25:45.2957701Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:25:45.2957982Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:25:45.2958239Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:25:45.2958581Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:25:45.2958954Z #define __STDC_HOSTED__ 1
2025-05-07T20:25:45.2959191Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:25:45.2959443Z #define __DBL_DIG__ 15
2025-05-07T20:25:45.2959663Z #define __FLT32_DIG__ 6
2025-05-07T20:25:45.2959947Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:25:45.2960284Z #define __SHRT_WIDTH__ 16
2025-05-07T20:25:45.2960613Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:25:45.2960924Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:25:45.2961259Z #define __STDC_UTF_16__ 1
2025-05-07T20:25:45.2961496Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:25:45.2961741Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:25:45.2962109Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:25:45.2962489Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:25:45.2962766Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:25:45.2963013Z #define __unix__ 1
2025-05-07T20:25:45.2963220Z #define __INT_WIDTH__ 32
2025-05-07T20:25:45.2963455Z #define __SIZEOF_LONG__ 8
2025-05-07T20:25:45.2963690Z #define __STDC_IEC_559__ 1
2025-05-07T20:25:45.2963925Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:25:45.2964179Z #define __UINT16_C(c) c
2025-05-07T20:25:45.2964407Z #define __DECIMAL_DIG__ 21
2025-05-07T20:25:45.2964646Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:25:45.2964994Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:25:45.2965340Z #define __gnu_linux__ 1
2025-05-07T20:25:45.2965574Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:25:45.2965843Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:25:45.2966122Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:45.2966374Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:25:45.2966624Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:25:45.2976049Z #define __GNUC__ 11
2025-05-07T20:25:45.2976275Z #define __pie__ 2
2025-05-07T20:25:45.2976482Z #define __MMX__ 1
2025-05-07T20:25:45.2976702Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:25:45.2976963Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:25:45.2977236Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:25:45.2977505Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:25:45.2977843Z #define __DBL_MAX__ ((double)1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:45.2978235Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:45.2978553Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:25:45.2978805Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:25:45.2979064Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:25:45.2979350Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:25:45.2979609Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:25:45.2979860Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:25:45.2980126Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:25:45.2980410Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:25:45.2980671Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:25:45.2980933Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:25:45.2981177Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:25:45.2981436Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:25:45.2981689Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:25:45.2981948Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:25:45.2982197Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:25:45.2982504Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:45.2982852Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:25:45.2983116Z #define __SSE2_MATH__ 1
2025-05-07T20:25:45.2983359Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:25:45.2983763Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:45.2984048Z #define __amd64 1
2025-05-07T20:25:45.2984267Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:25:45.2984519Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:25:45.2984813Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:25:45.2985115Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:25:45.2985358Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:25:45.2985624Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:25:45.2985870Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:25:45.2986117Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:25:45.2986367Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:25:45.2986608Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:25:45.2986860Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:25:45.2987118Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:25:45.2987453Z #define __x86_64 1
2025-05-07T20:25:45.2987768Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:25:45.2988142Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:25:45.2988592Z #define __DBL_MIN__ ((double)2.22507385850720138309023271733240406e-308L)
2025-05-07T20:25:45.2989039Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:25:45.2989498Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:45.2989869Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:25:45.2990112Z #define __LP64__ 1
2025-05-07T20:25:45.2990326Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:45.2990663Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:25:45.2991027Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:25:45.2991295Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:25:45.2991552Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:25:45.2991822Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:25:45.2992089Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:25:45.2992339Z #define __REGISTER_PREFIX__ 
2025-05-07T20:25:45.2992591Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:25:45.2992839Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:25:45.2993081Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:25:45.2993399Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:25:45.2993748Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:25:45.2994009Z #define __FLT_DIG__ 6
2025-05-07T20:25:45.2994231Z #define __NO_INLINE__ 1
2025-05-07T20:25:45.2994464Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:25:45.2994779Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:25:45.2995111Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:25:45.2995435Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:25:45.2995768Z #define __VERSION__ "11.4.0"
2025-05-07T20:25:45.2996009Z #define __UINT64_C(c) c ## UL
2025-05-07T20:25:45.2996258Z #define _STDC_PREDEF_H 1
2025-05-07T20:25:45.2996515Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:25:45.2996798Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:25:45.2997076Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:25:45.2997332Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:25:45.2997616Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:45.2997936Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:25:45.2998187Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:25:45.2998435Z #define __FLT128_DIG__ 33
2025-05-07T20:25:45.2998665Z #define __INT32_C(c) c
2025-05-07T20:25:45.2998897Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:25:45.2999161Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:25:45.2999420Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:25:45.2999690Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:25:45.2999991Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:25:45.3000277Z #define unix 1
2025-05-07T20:25:45.3000495Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:25:45.3000799Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:45.3001087Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:25:45.3001486Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:25:45.3001797Z #define __FLT64X_DIG__ 18
2025-05-07T20:25:45.3002029Z #define __INT8_TYPE__ signed char
2025-05-07T20:25:45.3002277Z #define __ELF__ 1
2025-05-07T20:25:45.3002497Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:25:45.3002760Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:25:45.3003022Z #define __FLT_RADIX__ 2
2025-05-07T20:25:45.3003258Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:25:45.3003604Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:25:45.3003947Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:25:45.3004190Z #define __SSE_MATH__ 1
2025-05-07T20:25:45.3004403Z #define __k8 1
2025-05-07T20:25:45.3004683Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:25:45.3005040Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:25:45.3005417Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:25:45.3005705Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:25:45.3005958Z #define __LDBL_DIG__ 18
2025-05-07T20:25:45.3006249Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:25:45.3006581Z #define __x86_64__ 1
2025-05-07T20:25:45.3006852Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:25:45.3007173Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:25:45.3007507Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:45.3007800Z #define __FLT64_DIG__ 15
2025-05-07T20:25:45.3008070Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:45.3008403Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:25:45.3008699Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:25:45.3008956Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:25:45.3009222Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:45.3009504Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:25:45.3009864Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:25:45.3010250Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:25:45.3010527Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:25:45.3010849Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:25:45.3011154Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:25:45.3011432Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:25:45.3011702Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:25:45.3011991Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:25:45.3012263Z #define __SIZE_WIDTH__ 64
2025-05-07T20:25:45.3012487Z #define __SEG_FS 1
2025-05-07T20:25:45.3012710Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:25:45.3012975Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:25:45.3013231Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:45.3013506Z #define __SEG_GS 1
2025-05-07T20:25:45.3013812Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:25:45.3014173Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:25:45.3014435Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:25:45.3014709Z #define __INT16_TYPE__ short int
2025-05-07T20:25:45.3014972Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:25:45.3015252Z #define __STDC_VERSION__ 201710L
2025-05-07T20:25:45.3015508Z #define __SIZEOF_INT__ 4
2025-05-07T20:25:45.3015738Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:25:45.3015984Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:25:45.3016311Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:45.3016695Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:45.3016960Z #define linux 1
2025-05-07T20:25:45.3017172Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:25:45.3017435Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:25:45.3017689Z #define __FLT32X_DIG__ 15
2025-05-07T20:25:45.3017929Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:25:45.3018172Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:25:45.3018413Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:25:45.3018754Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:45.3019290Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:25:45.3019602Z #define __code_model_small__ 1
2025-05-07T20:25:45.3019865Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:25:45.3020137Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:25:45.3020372Z #define __k8__ 1
2025-05-07T20:25:45.3020586Z #define __INTPTR_TYPE__ long int
2025-05-07T20:25:45.3020859Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:25:45.3021140Z #define __WCHAR_TYPE__ int
2025-05-07T20:25:45.3021365Z #define __pic__ 2
2025-05-07T20:25:45.3021604Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:45.3021900Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:25:45.3022175Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:45.3022488Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:25:45.3022841Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:45.3023266Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:25:45.3023531Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:25:45.3023815Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:25:45.3024106Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:25:45.3024335Z #define __linux__ 1
2025-05-07T20:25:45.3024544Z #define __INT64_TYPE__ long int
2025-05-07T20:25:45.3024791Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:25:45.3025030Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:25:45.3025282Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:25:45.3025521Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:25:45.3025792Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:45.3026101Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:25:45.3026376Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:25:45.3026621Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:25:45.3026896Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:25:45.3027172Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:25:45.3027487Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:45.3027941Z #define __SSE__ 1
2025-05-07T20:25:45.3028166Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:25:45.3028495Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:45.3028819Z #define __amd64__ 1
2025-05-07T20:25:45.3029033Z #define __WINT_WIDTH__ 32
2025-05-07T20:25:45.3029276Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:25:45.3029527Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:25:45.3029786Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:25:45.3030037Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:25:45.3030296Z #define __SIZEOF_INT128__ 16
2025-05-07T20:25:45.3030544Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:25:45.3030805Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:25:45.3031053Z #define __ATOMIC_RELAXED 0
2025-05-07T20:25:45.3031384Z #define __DBL_EPSILON__ ((double)2.22044604925031308084726333618164062e-16L)
2025-05-07T20:25:45.3031839Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:25:45.3032191Z #define _LP64 1
2025-05-07T20:25:45.3032395Z #define __UINT8_C(c) c
2025-05-07T20:25:45.3032623Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:25:45.3032887Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:25:45.3033138Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:25:45.3033394Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:25:45.3033679Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:25:45.3034015Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:45.3034461Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:45.3034819Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:25:45.3035094Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:45.3035393Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:25:45.3035751Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:25:45.3036107Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:25:45.3036353Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:25:45.3036678Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:25:45.3037156Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:25:45.3037424Z #define __STDC_UTF_32__ 1
2025-05-07T20:25:45.3037665Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:25:45.3037912Z #define __FXSR__ 1
2025-05-07T20:25:45.3038200Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:45.3038644Z #define __DBL_NORM_MAX__ ((double)1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:45.3039040Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:45.3039335Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:25:45.3039574Z #define __UINT32_C(c) c ## U
2025-05-07T20:25:45.3039895Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:25:45.3040506Z #define __INT8_MAX__ 0x7f
2025-05-07T20:25:45.3040737Z #define __LONG_WIDTH__ 64
2025-05-07T20:25:45.3041133Z #define __PIC__ 2
2025-05-07T20:25:45.3041411Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:25:45.3041860Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:45.3042296Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:25:45.3042663Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:45.3043026Z #define __SSE2__ 1
2025-05-07T20:25:45.3043260Z #define __INT32_TYPE__ int
2025-05-07T20:25:45.3043529Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:25:45.3043796Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:25:45.3044121Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:25:45.3044484Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:25:45.3044745Z #define __INTMAX_TYPE__ long int
2025-05-07T20:25:45.3045002Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:25:45.3045261Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:45.3045527Z #define __ATOMIC_CONSUME 1
2025-05-07T20:25:45.3045763Z #define __GNUC_MINOR__ 4
2025-05-07T20:25:45.3046001Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:25:45.3046276Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:45.3046560Z #define __PIE__ 2
2025-05-07T20:25:45.3046875Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:25:45.3047266Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:25:45.3047595Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:25:45.3047938Z #define __INT16_C(c) c
2025-05-07T20:25:45.3048177Z #define __STDC__ 1
2025-05-07T20:25:45.3048405Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:25:45.3048665Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:25:45.3048904Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:25:45.3049186Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:25:45.3049523Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:25:45.3049833Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:25:45.3050092Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:25:45.3050360Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:25:45.3050609Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:25:45.3050890Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:25:45.3051172Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:25:45.3051431Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:25:45.3051708Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:45.3052087Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:45.3052445Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:25:45.3052729Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:25:45.3053007Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:25:45.3053249Z #define __ATOMIC_RELEASE 3
2025-05-07T20:25:45.3053398Z 
2025-05-07T20:25:45.3554809Z 
2025-05-07T20:25:45.3555114Z [INFO] Printing out all preprocessor defines in the C++ compiler ...
2025-05-07T20:25:45.3555591Z + conda run -n build_binary c++ -dM -E -x c++ -
2025-05-07T20:25:45.3555885Z 
2025-05-07T20:25:47.2426374Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:25:47.2426837Z #define __cpp_attributes 200809L
2025-05-07T20:25:47.2427527Z #define __cpp_nontype_template_parameter_auto 201606L
2025-05-07T20:25:47.2428027Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:25:47.2428309Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:25:47.2428564Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:25:47.2428908Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:25:47.2429261Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:25:47.2429534Z #define __cpp_aggregate_nsdmi 201304L
2025-05-07T20:25:47.2429838Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:25:47.2430131Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:25:47.2430395Z #define __INTMAX_C(c) c ## L
2025-05-07T20:25:47.2430644Z #define __CHAR_BIT__ 8
2025-05-07T20:25:47.2430868Z #define __UINT8_MAX__ 0xff
2025-05-07T20:25:47.2431116Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:25:47.2431365Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:25:47.2431803Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:25:47.2432074Z #define __cpp_static_assert 201411L
2025-05-07T20:25:47.2432362Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:25:47.2432656Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:47.2432942Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:25:47.2433222Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:25:47.2433538Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:25:47.2433842Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:25:47.2434234Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L)
2025-05-07T20:25:47.2434642Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:25:47.2434941Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:25:47.2435215Z #define __GCC_IEC_559 2
2025-05-07T20:25:47.2435452Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:25:47.2435719Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:25:47.2435993Z #define __cpp_binary_literals 201304L
2025-05-07T20:25:47.2436273Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:25:47.2436557Z #define __cpp_noexcept_function_type 201510L
2025-05-07T20:25:47.2436862Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:25:47.2437164Z #define __cpp_variadic_templates 200704L
2025-05-07T20:25:47.2437485Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:47.2437791Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:25:47.2438057Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:25:47.2438329Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:25:47.2438594Z #define __cpp_variable_templates 201304L
2025-05-07T20:25:47.2438895Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:25:47.2439154Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:25:47.2439403Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:25:47.2439673Z #define __cpp_rvalue_reference 200610L
2025-05-07T20:25:47.2439997Z #define __cpp_nested_namespace_definitions 201411L
2025-05-07T20:25:47.2440650Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:25:47.2440905Z #define __INT8_C(c) c
2025-05-07T20:25:47.2441136Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:25:47.2441411Z #define __cpp_variadic_using 201611L
2025-05-07T20:25:47.2441719Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:47.2442035Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:25:47.2442304Z #define __cpp_capture_star_this 201603L
2025-05-07T20:25:47.2442581Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:25:47.2442899Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:47.2443242Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:25:47.2443513Z #define __cpp_if_constexpr 201606L
2025-05-07T20:25:47.2443781Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:25:47.2444042Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:47.2444305Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:25:47.2444584Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:25:47.2444967Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:25:47.2445377Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:25:47.2445652Z #define __linux 1
2025-05-07T20:25:47.2446021Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:25:47.2446295Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:25:47.2446557Z #define __unix 1
2025-05-07T20:25:47.2446777Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:25:47.2447051Z #define __GXX_EXPERIMENTAL_CXX0X__ 1
2025-05-07T20:25:47.2447327Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:25:47.2447618Z #define __WINT_MIN__ 0U
2025-05-07T20:25:47.2447876Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:25:47.2448161Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:25:47.2448429Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:25:47.2448687Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:25:47.2448928Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:25:47.2449210Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:25:47.2449496Z #define __INT64_C(c) c ## L
2025-05-07T20:25:47.2449874Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:25:47.2450166Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:25:47.2450430Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:25:47.2450724Z #define __cpp_aligned_new 201606L
2025-05-07T20:25:47.2451000Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:25:47.2451256Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:25:47.2451594Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:25:47.2451961Z #define __STDC_HOSTED__ 1
2025-05-07T20:25:47.2452209Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:25:47.2452478Z #define __cpp_decltype_auto 201304L
2025-05-07T20:25:47.2452741Z #define __DBL_DIG__ 15
2025-05-07T20:25:47.2452965Z #define __FLT32_DIG__ 6
2025-05-07T20:25:47.2453259Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:25:47.2453589Z #define __GXX_WEAK__ 1
2025-05-07T20:25:47.2453818Z #define __SHRT_WIDTH__ 16
2025-05-07T20:25:47.2454060Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:25:47.2463190Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:25:47.2463579Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:25:47.2463859Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:25:47.2464168Z #define __cpp_threadsafe_static_init 200806L
2025-05-07T20:25:47.2464497Z #define __cpp_enumerator_attributes 201411L
2025-05-07T20:25:47.2464901Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:25:47.2465300Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:25:47.2465580Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:25:47.2465833Z #define __unix__ 1
2025-05-07T20:25:47.2466059Z #define __INT_WIDTH__ 32
2025-05-07T20:25:47.2466299Z #define __SIZEOF_LONG__ 8
2025-05-07T20:25:47.2466544Z #define __STDC_IEC_559__ 1
2025-05-07T20:25:47.2466786Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:25:47.2467047Z #define __UINT16_C(c) c
2025-05-07T20:25:47.2467285Z #define __DECIMAL_DIG__ 21
2025-05-07T20:25:47.2467531Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:25:47.2468004Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:25:47.2468370Z #define __gnu_linux__ 1
2025-05-07T20:25:47.2468609Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:25:47.2468863Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:25:47.2469142Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:25:47.2469423Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:47.2469681Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:25:47.2469937Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:25:47.2470182Z #define __GNUC__ 11
2025-05-07T20:25:47.2470391Z #define __GXX_RTTI 1
2025-05-07T20:25:47.2470611Z #define __pie__ 2
2025-05-07T20:25:47.2470820Z #define __MMX__ 1
2025-05-07T20:25:47.2471031Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:25:47.2471292Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:25:47.2471564Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:25:47.2471818Z #define __STDC_UTF_16__ 1
2025-05-07T20:25:47.2472068Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:25:47.2472365Z #define __cpp_delegating_constructors 200604L
2025-05-07T20:25:47.2472672Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:25:47.2473131Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:47.2473493Z #define __cpp_raw_strings 200710L
2025-05-07T20:25:47.2473790Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:47.2474088Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:25:47.2474338Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:25:47.2474591Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:25:47.2474886Z #define __cpp_fold_expressions 201603L
2025-05-07T20:25:47.2475170Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:25:47.2475425Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:25:47.2475673Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:25:47.2475946Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:25:47.2476228Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:25:47.2476482Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:25:47.2476844Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:25:47.2477093Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:25:47.2477355Z #define __cplusplus 201703L
2025-05-07T20:25:47.2477645Z #define __cpp_ref_qualifiers 200710L
2025-05-07T20:25:47.2477939Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:25:47.2478190Z #define __DEPRECATED 1
2025-05-07T20:25:47.2478430Z #define __cpp_rvalue_references 200610L
2025-05-07T20:25:47.2478717Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:25:47.2478966Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:25:47.2479269Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:47.2479615Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:25:47.2479875Z #define __SSE2_MATH__ 1
2025-05-07T20:25:47.2480112Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:25:47.2480404Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:47.2480688Z #define __amd64 1
2025-05-07T20:25:47.2480905Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:25:47.2481167Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:25:47.2481435Z #define __GNUG__ 11
2025-05-07T20:25:47.2481685Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:25:47.2481987Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:25:47.2482232Z #define __cpp_nsdmi 200809L
2025-05-07T20:25:47.2482482Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:25:47.2482743Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:25:47.2482990Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:25:47.2483258Z #define __cpp_initializer_lists 200806L
2025-05-07T20:25:47.2483537Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:25:47.2483799Z #define __cpp_hex_float 201603L
2025-05-07T20:25:47.2484058Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:25:47.2484309Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:25:47.2484582Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:25:47.2484845Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:25:47.2485098Z #define __x86_64 1
2025-05-07T20:25:47.2485328Z #define __cpp_lambdas 200907L
2025-05-07T20:25:47.2485587Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:25:47.2485951Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:25:47.2486336Z #define __cpp_template_auto 201606L
2025-05-07T20:25:47.2486688Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L)
2025-05-07T20:25:47.2487136Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:25:47.2487598Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:47.2487970Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:25:47.2488214Z #define __LP64__ 1
2025-05-07T20:25:47.2488430Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:25:47.2488768Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:25:47.2489134Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:25:47.2489398Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:25:47.2489672Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:25:47.2489937Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:25:47.2490200Z #define __REGISTER_PREFIX__ 
2025-05-07T20:25:47.2490446Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:25:47.2490792Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:25:47.2491110Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:25:47.2491450Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:25:47.2491719Z #define __FLT_DIG__ 6
2025-05-07T20:25:47.2491945Z #define __NO_INLINE__ 1
2025-05-07T20:25:47.2492172Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:25:47.2492533Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:25:47.2492865Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:25:47.2493117Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:25:47.2493374Z #define __VERSION__ "11.4.0"
2025-05-07T20:25:47.2493618Z #define __UINT64_C(c) c ## UL
2025-05-07T20:25:47.2493890Z #define __cpp_unicode_characters 201411L
2025-05-07T20:25:47.2494179Z #define _STDC_PREDEF_H 1
2025-05-07T20:25:47.2494425Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:25:47.2494855Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:25:47.2495136Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:25:47.2495398Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:25:47.2495693Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:47.2496023Z #define __cpp_aggregate_bases 201603L
2025-05-07T20:25:47.2496310Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:25:47.2496561Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:25:47.2496814Z #define __FLT128_DIG__ 33
2025-05-07T20:25:47.2497051Z #define __INT32_C(c) c
2025-05-07T20:25:47.2497281Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:25:47.2497560Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:25:47.2497838Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:25:47.2498103Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:25:47.2498410Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:25:47.2498709Z #define unix 1
2025-05-07T20:25:47.2498918Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:25:47.2499181Z #define __cpp_rtti 199711L
2025-05-07T20:25:47.2499440Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:25:47.2499739Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:47.2500045Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:25:47.2500348Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:25:47.2500669Z #define __FLT64X_DIG__ 18
2025-05-07T20:25:47.2500912Z #define __INT8_TYPE__ signed char
2025-05-07T20:25:47.2501196Z #define __cpp_digit_separators 201309L
2025-05-07T20:25:47.2501470Z #define __ELF__ 1
2025-05-07T20:25:47.2501687Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:25:47.2501964Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:25:47.2502232Z #define __FLT_RADIX__ 2
2025-05-07T20:25:47.2502468Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:25:47.2502814Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:25:47.2503166Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:25:47.2503427Z #define __GLIBCXX_BITSIZE_INT_N_0 128
2025-05-07T20:25:47.2503702Z #define __k8 1
2025-05-07T20:25:47.2503994Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:25:47.2504359Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:25:47.2504644Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:25:47.2504936Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:25:47.2505192Z #define __LDBL_DIG__ 18
2025-05-07T20:25:47.2505424Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:25:47.2505675Z #define __x86_64__ 1
2025-05-07T20:25:47.2505908Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:25:47.2506195Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:25:47.2506522Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:47.2506821Z #define __FLT64_DIG__ 15
2025-05-07T20:25:47.2507092Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:47.2507429Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:25:47.2507837Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:25:47.2508103Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:25:47.2508379Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:47.2508671Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:25:47.2509123Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:25:47.2509507Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:25:47.2509791Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:25:47.2510106Z #define __cpp_unicode_literals 200710L
2025-05-07T20:25:47.2510410Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:25:47.2510723Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:25:47.2511010Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:25:47.2511272Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:25:47.2511561Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:25:47.2511830Z #define __SIZE_WIDTH__ 64
2025-05-07T20:25:47.2512056Z #define __SEG_FS 1
2025-05-07T20:25:47.2512274Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:25:47.2512534Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:25:47.2512878Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:47.2513149Z #define __SEG_GS 1
2025-05-07T20:25:47.2513456Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:25:47.2513822Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:25:47.2514077Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:25:47.2514351Z #define __INT16_TYPE__ short int
2025-05-07T20:25:47.2514618Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:25:47.2514912Z #define __cpp_structured_bindings 201606L
2025-05-07T20:25:47.2515190Z #define __SIZEOF_INT__ 4
2025-05-07T20:25:47.2515423Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:25:47.2515669Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:25:47.2515992Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:47.2516362Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:47.2516659Z #define __cpp_sized_deallocation 201309L
2025-05-07T20:25:47.2516970Z #define __cpp_guaranteed_copy_elision 201606L
2025-05-07T20:25:47.2517257Z #define linux 1
2025-05-07T20:25:47.2517477Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:25:47.2517780Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:25:47.2518058Z #define __EXCEPTIONS 1
2025-05-07T20:25:47.2518298Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:25:47.2518544Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:25:47.2518803Z #define __cpp_range_based_for 201603L
2025-05-07T20:25:47.2519085Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:25:47.2519422Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:25:47.2519797Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16
2025-05-07T20:25:47.2520137Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:25:47.2520456Z #define __code_model_small__ 1
2025-05-07T20:25:47.2520719Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:25:47.2521015Z #define __cpp_nontype_template_args 201411L
2025-05-07T20:25:47.2521308Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:25:47.2521577Z #define __cpp_return_type_deduction 201304L
2025-05-07T20:25:47.2521858Z #define __k8__ 1
2025-05-07T20:25:47.2522085Z #define __INTPTR_TYPE__ long int
2025-05-07T20:25:47.2522358Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:25:47.2522647Z #define __WCHAR_TYPE__ int
2025-05-07T20:25:47.2522884Z #define __pic__ 2
2025-05-07T20:25:47.2523126Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:47.2523428Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:25:47.2523692Z #define __cpp_decltype 200707L
2025-05-07T20:25:47.2523972Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:47.2524292Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:25:47.2524650Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:47.2525002Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:25:47.2525286Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:25:47.2525601Z #define __cpp_inline_variables 201606L
2025-05-07T20:25:47.2525893Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:25:47.2526138Z #define __linux__ 1
2025-05-07T20:25:47.2526362Z #define __INT64_TYPE__ long int
2025-05-07T20:25:47.2526710Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:25:47.2526960Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:25:47.2527226Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:25:47.2527506Z #define __cpp_inheriting_constructors 201511L
2025-05-07T20:25:47.2527811Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:25:47.2528102Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:47.2528409Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:25:47.2528674Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:25:47.2528953Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:25:47.2529242Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:25:47.2529568Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:25:47.2529908Z #define __SSE__ 1
2025-05-07T20:25:47.2530133Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:25:47.2530546Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:47.2530882Z #define __amd64__ 1
2025-05-07T20:25:47.2531110Z #define __WINT_WIDTH__ 32
2025-05-07T20:25:47.2531358Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:25:47.2531617Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:25:47.2531874Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:25:47.2532140Z #define __SIZEOF_INT128__ 16
2025-05-07T20:25:47.2532387Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:25:47.2532656Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:25:47.2532921Z #define __ATOMIC_RELAXED 0
2025-05-07T20:25:47.2533261Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L)
2025-05-07T20:25:47.2533704Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:25:47.2534053Z #define _LP64 1
2025-05-07T20:25:47.2534265Z #define __UINT8_C(c) c
2025-05-07T20:25:47.2534493Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:25:47.2534753Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:25:47.2535022Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:25:47.2535271Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:25:47.2535615Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:25:47.2536069Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:25:47.2536427Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:25:47.2536715Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:25:47.2537016Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:25:47.2537317Z #define __cpp_namespace_attributes 201411L
2025-05-07T20:25:47.2537681Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:25:47.2538037Z #define __STDCPP_THREADS__ 1
2025-05-07T20:25:47.2538295Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:25:47.2538546Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:25:47.2538881Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:25:47.2539240Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:25:47.2539493Z #define __STDC_UTF_32__ 1
2025-05-07T20:25:47.2539741Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:25:47.2539993Z #define __FXSR__ 1
2025-05-07T20:25:47.2540529Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:47.2540972Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:25:47.2541376Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:25:47.2541684Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:25:47.2541936Z #define __cpp_runtime_arrays 198712L
2025-05-07T20:25:47.2542226Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:25:47.2542511Z #define __UINT32_C(c) c ## U
2025-05-07T20:25:47.2542766Z #define __cpp_alias_templates 200704L
2025-05-07T20:25:47.2543119Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:25:47.2543471Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:25:47.2543729Z #define __INT8_MAX__ 0x7f
2025-05-07T20:25:47.2543969Z #define __LONG_WIDTH__ 64
2025-05-07T20:25:47.2544206Z #define __PIC__ 2
2025-05-07T20:25:47.2544448Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:25:47.2544984Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:25:47.2545358Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:25:47.2545683Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:25:47.2546015Z #define __cpp_constexpr 201603L
2025-05-07T20:25:47.2546268Z #define __SSE2__ 1
2025-05-07T20:25:47.2546502Z #define __cpp_deduction_guides 201703L
2025-05-07T20:25:47.2546774Z #define __INT32_TYPE__ int
2025-05-07T20:25:47.2547018Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:25:47.2547277Z #define __cpp_exceptions 199711L
2025-05-07T20:25:47.2547544Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:25:47.2547941Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:25:47.2548287Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:25:47.2548545Z #define __INTMAX_TYPE__ long int
2025-05-07T20:25:47.2548923Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:25:47.2549190Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:25:47.2549467Z #define __ATOMIC_CONSUME 1
2025-05-07T20:25:47.2549701Z #define __GNUC_MINOR__ 4
2025-05-07T20:25:47.2549953Z #define __GLIBCXX_TYPE_INT_N_0 __int128
2025-05-07T20:25:47.2550239Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:25:47.2550513Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:47.2550802Z #define __PIE__ 2
2025-05-07T20:25:47.2551128Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:25:47.2551540Z #define __cpp_template_template_args 201611L
2025-05-07T20:25:47.2551841Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:25:47.2552177Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:25:47.2552527Z #define __INT16_C(c) c
2025-05-07T20:25:47.2552751Z #define __STDC__ 1
2025-05-07T20:25:47.2552971Z #define __FLT32X_DIG__ 15
2025-05-07T20:25:47.2553230Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:25:47.2553491Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:25:47.2553743Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:25:47.2554049Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:25:47.2554379Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:25:47.2554706Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:25:47.2554970Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:25:47.2555248Z #define __cpp_generic_lambdas 201304L
2025-05-07T20:25:47.2555527Z #define __SSE_MATH__ 1
2025-05-07T20:25:47.2555762Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:25:47.2556032Z #define __cpp_user_defined_literals 200809L
2025-05-07T20:25:47.2556334Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:25:47.2556612Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:25:47.2556891Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:25:47.2557163Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:25:47.2557454Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:25:47.2557845Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:25:47.2558212Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:25:47.2558512Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:25:47.2558790Z #define _GNU_SOURCE 1
2025-05-07T20:25:47.2559024Z #define __cpp_init_captures 201304L
2025-05-07T20:25:47.2559299Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:25:47.2559545Z #define __ATOMIC_RELEASE 3
2025-05-07T20:25:47.2559699Z 
2025-05-07T20:25:47.3043399Z 
2025-05-07T20:25:47.3043750Z + conda run -n build_binary c++ --version
2025-05-07T20:25:47.3043971Z 
2025-05-07T20:25:49.1760063Z c++ (conda-forge gcc 11.4.0-13) 11.4.0
2025-05-07T20:25:49.1760465Z Copyright (C) 2021 Free Software Foundation, Inc.
2025-05-07T20:25:49.1760909Z This is free software; see the source for copying conditions.  There is NO
2025-05-07T20:25:49.1761436Z warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
2025-05-07T20:25:49.1761750Z 
2025-05-07T20:25:49.1761755Z 
2025-05-07T20:25:49.2385679Z 
2025-05-07T20:25:49.2386421Z [INFO] Printing the default version of the C standard used by the compiler ...
2025-05-07T20:25:49.2388001Z + conda run -n build_binary cc -dM -E - < /dev/null | grep __STDC_VERSION__
2025-05-07T20:25:49.2388339Z 
2025-05-07T20:25:51.1879556Z #define __STDC_VERSION__ 201710L
2025-05-07T20:25:51.1882035Z 
2025-05-07T20:25:51.1882561Z [INFO] Printing the default version of the C++ standard used by the compiler ...
2025-05-07T20:25:51.1883134Z + conda run -n build_binary c++ -dM -E -x c++ - < /dev/null | grep __cplusplus
2025-05-07T20:25:51.1883436Z 
2025-05-07T20:25:53.1323649Z #define __cplusplus 201703L
2025-05-07T20:25:53.1325937Z 
2025-05-07T20:25:53.1326524Z [INSTALL] Successfully installed C/C++ compilers
2025-05-07T20:25:53.1374930Z ##[group]Run . $PRELUDE; install_cuda $BUILD_ENV 12.8.0
2025-05-07T20:25:53.1375336Z [36;1m. $PRELUDE; install_cuda $BUILD_ENV 12.8.0[0m
2025-05-07T20:25:53.1388998Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:25:53.1389350Z env:
2025-05-07T20:25:53.1389570Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:25:53.1389876Z   BUILD_ENV: build_binary
2025-05-07T20:25:53.1390123Z   BUILD_TARGET: genai
2025-05-07T20:25:53.1390356Z   BUILD_VARIANT: cuda
2025-05-07T20:25:53.1390581Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:25:53.1390835Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:25:53.1391131Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:25:53.1391464Z ##[endgroup]
2025-05-07T20:25:53.4747276Z ################################################################################
2025-05-07T20:25:53.4747850Z # Install CUDA
2025-05-07T20:25:53.4748121Z #
2025-05-07T20:25:53.4763647Z # [2025-05-07T20:25:53.476Z] + install_cuda build_binary 12.8.0
2025-05-07T20:25:53.4764163Z ################################################################################
2025-05-07T20:25:53.4764447Z 
2025-05-07T20:25:53.4779216Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:25:53.5685555Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:25:53.5686037Z [SETUP] Cleaning up Conda packages ...
2025-05-07T20:25:53.5690174Z + conda clean --packages --tarball -y
2025-05-07T20:25:53.5690460Z 
2025-05-07T20:25:54.2702101Z Will remove 29 (113.6 MB) tarball(s).
2025-05-07T20:25:54.2702627Z Will remove 6 (619 KB) package(s).
2025-05-07T20:25:54.3323425Z 
2025-05-07T20:25:54.3332948Z + conda clean --all -y
2025-05-07T20:25:54.3333208Z 
2025-05-07T20:25:54.9934532Z There are no unused tarball(s) to remove.
2025-05-07T20:25:54.9935162Z Will remove 1 index cache(s).
2025-05-07T20:25:54.9935721Z There are no unused package(s) to remove.
2025-05-07T20:25:54.9936322Z There are no tempfile(s) to remove.
2025-05-07T20:25:54.9936891Z There are no logfile(s) to remove.
2025-05-07T20:25:55.0557819Z 
2025-05-07T20:25:55.0572122Z [INSTALL] Installing CUDA 12.8.0 ...
2025-05-07T20:25:55.0596566Z [EXEC] [ATTEMPT 0/3]    + conda install --force-reinstall -n build_binary -c conda-forge --override-channels -y cuda=12.8.0
2025-05-07T20:25:55.9645870Z Channels:
2025-05-07T20:26:06.3348880Z  - conda-forge
2025-05-07T20:26:06.3349166Z Platform: linux-64
2025-05-07T20:26:06.3349908Z Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | done
2025-05-07T20:26:07.4486670Z Solving environment: - \ | / done
2025-05-07T20:26:07.5216775Z 
2025-05-07T20:26:07.5217217Z ## Package Plan ##
2025-05-07T20:26:07.5217420Z 
2025-05-07T20:26:07.5217623Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:26:07.5217917Z 
2025-05-07T20:26:07.5218015Z   added / updated specs:
2025-05-07T20:26:07.5218245Z     - cuda=12.8.0
2025-05-07T20:26:07.5218378Z 
2025-05-07T20:26:07.5218408Z 
2025-05-07T20:26:07.5218525Z The following packages will be downloaded:
2025-05-07T20:26:07.5218738Z 
2025-05-07T20:26:07.5218847Z     package                    |            build
2025-05-07T20:26:07.5219160Z     ---------------------------|-----------------
2025-05-07T20:26:07.5219511Z     alsa-lib-1.2.14            |       hb9d3cd8_0         553 KB  conda-forge
2025-05-07T20:26:07.5220415Z     attr-2.5.1                 |       h166bdaf_1          69 KB  conda-forge
2025-05-07T20:26:07.5220935Z     binutils-2.40              |       h4852527_7          31 KB  conda-forge
2025-05-07T20:26:07.5221515Z     c-compiler-1.5.2           |       h0b41bf4_0           6 KB  conda-forge
2025-05-07T20:26:07.5222064Z     cuda-12.8.0                |       ha804496_0          26 KB  conda-forge
2025-05-07T20:26:07.5222552Z     cuda-cccl_linux-64-12.8.55 |       ha770c72_1         1.0 MB  conda-forge
2025-05-07T20:26:07.5223297Z     cuda-command-line-tools-12.8.0|       ha770c72_0          20 KB  conda-forge
2025-05-07T20:26:07.5223793Z     cuda-compiler-12.8.0       |       hbad6d8a_0          20 KB  conda-forge
2025-05-07T20:26:07.5224257Z     cuda-crt-dev_linux-64-12.8.61|       ha770c72_1          90 KB  conda-forge
2025-05-07T20:26:07.5224724Z     cuda-crt-tools-12.8.61     |       ha770c72_1          27 KB  conda-forge
2025-05-07T20:26:07.5225160Z     cuda-cudart-12.8.57        |       h5888daf_1          22 KB  conda-forge
2025-05-07T20:26:07.5225599Z     cuda-cudart-dev-12.8.57    |       h5888daf_1          23 KB  conda-forge
2025-05-07T20:26:07.5226087Z     cuda-cudart-dev_linux-64-12.8.57|       h3f2d84a_1         377 KB  conda-forge
2025-05-07T20:26:07.5226581Z     cuda-cudart-static-12.8.57 |       h5888daf_1          22 KB  conda-forge
2025-05-07T20:26:07.5227091Z     cuda-cudart-static_linux-64-12.8.57|       h3f2d84a_1         950 KB  conda-forge
2025-05-07T20:26:07.5227595Z     cuda-cudart_linux-64-12.8.57|       h3f2d84a_1         188 KB  conda-forge
2025-05-07T20:26:07.5228167Z     cuda-cuobjdump-12.8.55     |       hbd13f7d_0         227 KB  conda-forge
2025-05-07T20:26:07.5228608Z     cuda-cupti-12.8.57         |       hbd13f7d_0         1.8 MB  conda-forge
2025-05-07T20:26:07.5229032Z     cuda-cupti-dev-12.8.57     |       h5888daf_0         4.0 MB  conda-forge
2025-05-07T20:26:07.5229479Z     cuda-cuxxfilt-12.8.55      |       hbd13f7d_0         211 KB  conda-forge
2025-05-07T20:26:07.5229931Z     cuda-driver-dev-12.8.57    |       h5888daf_1          22 KB  conda-forge
2025-05-07T20:26:07.5230406Z     cuda-driver-dev_linux-64-12.8.90|       h3f2d84a_1          36 KB  conda-forge
2025-05-07T20:26:07.5230857Z     cuda-gdb-12.8.55           |       h50b4baa_0         353 KB  conda-forge
2025-05-07T20:26:07.5231286Z     cuda-libraries-12.8.0      |       ha770c72_0          20 KB  conda-forge
2025-05-07T20:26:07.5231758Z     cuda-libraries-dev-12.8.0  |       ha770c72_0          20 KB  conda-forge
2025-05-07T20:26:07.5232221Z     cuda-nsight-12.8.55        |       h7938cbb_0       113.2 MB  conda-forge
2025-05-07T20:26:07.5232640Z     cuda-nvcc-12.8.61          |       hcdd1206_0          23 KB  conda-forge
2025-05-07T20:26:07.5233082Z     cuda-nvcc-dev_linux-64-12.8.61|       he91c749_1        12.7 MB  conda-forge
2025-05-07T20:26:07.5233546Z     cuda-nvcc-impl-12.8.61     |       h85509e4_1          25 KB  conda-forge
2025-05-07T20:26:07.5233984Z     cuda-nvcc-tools-12.8.61    |       he02047a_1        24.5 MB  conda-forge
2025-05-07T20:26:07.5234436Z     cuda-nvcc_linux-64-12.8.61 |       h04802cd_0          25 KB  conda-forge
2025-05-07T20:26:07.5234879Z     cuda-nvdisasm-12.8.55      |       hbd13f7d_0         4.9 MB  conda-forge
2025-05-07T20:26:07.5235315Z     cuda-nvml-dev-12.8.55      |       hbd13f7d_0         134 KB  conda-forge
2025-05-07T20:26:07.5235738Z     cuda-nvprof-12.8.57        |       hbd13f7d_0         2.5 MB  conda-forge
2025-05-07T20:26:07.5236170Z     cuda-nvprune-12.8.55       |       hbd13f7d_0          68 KB  conda-forge
2025-05-07T20:26:07.5236602Z     cuda-nvrtc-12.8.61         |       hbd13f7d_0        63.1 MB  conda-forge
2025-05-07T20:26:07.5237025Z     cuda-nvrtc-dev-12.8.61     |       h5888daf_0          34 KB  conda-forge
2025-05-07T20:26:07.5237458Z     cuda-nvtx-12.8.55          |       hbd13f7d_0          31 KB  conda-forge
2025-05-07T20:26:07.5237897Z     cuda-nvvm-dev_linux-64-12.8.61|       ha770c72_1          25 KB  conda-forge
2025-05-07T20:26:07.5238447Z     cuda-nvvm-impl-12.8.61     |       he02047a_1        20.8 MB  conda-forge
2025-05-07T20:26:07.5238883Z     cuda-nvvm-tools-12.8.61    |       he02047a_1        23.5 MB  conda-forge
2025-05-07T20:26:07.5239307Z     cuda-nvvp-12.8.57          |       hbd13f7d_0       112.4 MB  conda-forge
2025-05-07T20:26:07.5239724Z     cuda-opencl-12.8.55        |       hbd13f7d_0          29 KB  conda-forge
2025-05-07T20:26:07.5240551Z     cuda-opencl-dev-12.8.55    |       h5888daf_0          95 KB  conda-forge
2025-05-07T20:26:07.5241199Z     cuda-profiler-api-12.8.55  |       h7938cbb_0          22 KB  conda-forge
2025-05-07T20:26:07.5241732Z     cuda-runtime-12.8.0        |       ha804496_0          20 KB  conda-forge
2025-05-07T20:26:07.5242268Z     cuda-sanitizer-api-12.8.55 |       hbd13f7d_0         8.8 MB  conda-forge
2025-05-07T20:26:07.5242799Z     cuda-toolkit-12.8.0        |       ha804496_0          20 KB  conda-forge
2025-05-07T20:26:07.5243283Z     cuda-tools-12.8.0          |       ha770c72_0          19 KB  conda-forge
2025-05-07T20:26:07.5243777Z     cuda-version-12.8          |       h5d125a7_3          21 KB  conda-forge
2025-05-07T20:26:07.5244298Z     cuda-visual-tools-12.8.0   |       ha770c72_0          20 KB  conda-forge
2025-05-07T20:26:07.5244820Z     cxx-compiler-1.5.2         |       hf52228f_0           6 KB  conda-forge
2025-05-07T20:26:07.5245276Z     dbus-1.13.6                |       h5008d03_3         604 KB  conda-forge
2025-05-07T20:26:07.5245708Z     expat-2.7.0                |       h5888daf_0         137 KB  conda-forge
2025-05-07T20:26:07.5246242Z     font-ttf-dejavu-sans-mono-2.37|       hab24e00_0         388 KB  conda-forge
2025-05-07T20:26:07.5246836Z     font-ttf-inconsolata-3.000 |       h77eed37_0          94 KB  conda-forge
2025-05-07T20:26:07.5247423Z     font-ttf-source-code-pro-2.038|       h77eed37_0         684 KB  conda-forge
2025-05-07T20:26:07.5247988Z     font-ttf-ubuntu-0.83       |       h77eed37_3         1.5 MB  conda-forge
2025-05-07T20:26:07.5248499Z     fontconfig-2.15.0          |       h7e30c49_1         259 KB  conda-forge
2025-05-07T20:26:07.5249022Z     fonts-conda-ecosystem-1    |                0           4 KB  conda-forge
2025-05-07T20:26:07.5249559Z     fonts-conda-forge-1        |                0           4 KB  conda-forge
2025-05-07T20:26:07.5250056Z     freetype-2.13.3            |       ha770c72_1         168 KB  conda-forge
2025-05-07T20:26:07.5250502Z     gcc-11.4.0                 |      h602e360_13          49 KB  conda-forge
2025-05-07T20:26:07.5250945Z     gds-tools-1.13.0.11        |       h5888daf_0        37.9 MB  conda-forge
2025-05-07T20:26:07.5251399Z     gmp-6.3.0                  |       hac33072_2         449 KB  conda-forge
2025-05-07T20:26:07.5251814Z     gxx-11.4.0                 |      h602e360_13          49 KB  conda-forge
2025-05-07T20:26:07.5252249Z     keyutils-1.6.1             |       h166bdaf_0         115 KB  conda-forge
2025-05-07T20:26:07.5252693Z     krb5-1.21.3                |       h659f571_0         1.3 MB  conda-forge
2025-05-07T20:26:07.5253133Z     libcap-2.71                |       h39aace5_0         100 KB  conda-forge
2025-05-07T20:26:07.5253602Z     libcublas-12.8.3.14        |       h9ab20c4_0       460.2 MB  conda-forge
2025-05-07T20:26:07.5254105Z     libcublas-dev-12.8.3.14    |       h9ab20c4_0          89 KB  conda-forge
2025-05-07T20:26:07.5254608Z     libcufft-11.3.3.41         |       hbd13f7d_0       147.4 MB  conda-forge
2025-05-07T20:26:07.5255106Z     libcufft-dev-11.3.3.41     |       h5888daf_0          33 KB  conda-forge
2025-05-07T20:26:07.5255604Z     libcufile-1.13.0.11        |       h12f29b5_0         939 KB  conda-forge
2025-05-07T20:26:07.5256114Z     libcufile-dev-1.13.0.11    |       h5888daf_0          35 KB  conda-forge
2025-05-07T20:26:07.5256621Z     libcurand-10.3.9.55        |       hbd13f7d_0        43.6 MB  conda-forge
2025-05-07T20:26:07.5257127Z     libcurand-dev-10.3.9.55    |       h5888daf_0         265 KB  conda-forge
2025-05-07T20:26:07.5257769Z     libcusolver-11.7.2.55      |       h9ab20c4_0       156.9 MB  conda-forge
2025-05-07T20:26:07.5258303Z     libcusolver-dev-11.7.2.55  |       h9ab20c4_0          59 KB  conda-forge
2025-05-07T20:26:07.5258832Z     libcusparse-12.5.7.53      |       hbd13f7d_0       164.9 MB  conda-forge
2025-05-07T20:26:07.5259360Z     libcusparse-dev-12.5.7.53  |       h5888daf_0          51 KB  conda-forge
2025-05-07T20:26:07.5259896Z     libedit-3.1.20250104       | pl5321h7949ede_0         132 KB  conda-forge
2025-05-07T20:26:07.5260397Z     libexpat-2.7.0             |       h5888daf_0          73 KB  conda-forge
2025-05-07T20:26:07.5260969Z     libfreetype-2.13.3         |       ha770c72_1           8 KB  conda-forge
2025-05-07T20:26:07.5261473Z     libfreetype6-2.13.3        |       h48d6fc4_1         371 KB  conda-forge
2025-05-07T20:26:07.5261984Z     libgcrypt-lib-1.11.0       |       hb9d3cd8_2         572 KB  conda-forge
2025-05-07T20:26:07.5262480Z     libglib-2.84.0             |       h2ff4ddf_0         3.8 MB  conda-forge
2025-05-07T20:26:07.5262949Z     libglvnd-1.7.0             |       ha4b6fd6_2         129 KB  conda-forge
2025-05-07T20:26:07.5263427Z     libgpg-error-1.55          |       h3f2d84a_0         305 KB  conda-forge
2025-05-07T20:26:07.5263908Z     libiconv-1.18              |       h4ce23a2_1         696 KB  conda-forge
2025-05-07T20:26:07.5264364Z     libnl-3.11.0               |       hb9d3cd8_0         724 KB  conda-forge
2025-05-07T20:26:07.5264813Z     libnpp-12.3.3.65           |       hbd13f7d_0       130.6 MB  conda-forge
2025-05-07T20:26:07.5265300Z     libnpp-dev-12.3.3.65       |       h5888daf_0         443 KB  conda-forge
2025-05-07T20:26:07.5265780Z     libnuma-2.0.18             |       h4ab18f5_2          42 KB  conda-forge
2025-05-07T20:26:07.5266266Z     libnvfatbin-12.8.55        |       hbd13f7d_0         793 KB  conda-forge
2025-05-07T20:26:07.5266786Z     libnvfatbin-dev-12.8.55    |       h5888daf_0          26 KB  conda-forge
2025-05-07T20:26:07.5267323Z     libnvjitlink-12.8.61       |       hbd13f7d_0        28.7 MB  conda-forge
2025-05-07T20:26:07.5267926Z     libnvjitlink-dev-12.8.61   |       h5888daf_0          25 KB  conda-forge
2025-05-07T20:26:07.5268361Z     libnvjpeg-12.3.5.57        |       h97fd463_0         3.0 MB  conda-forge
2025-05-07T20:26:07.5268793Z     libnvjpeg-dev-12.3.5.57    |       ha770c72_0          31 KB  conda-forge
2025-05-07T20:26:07.5269219Z     libopengl-1.7.0            |       ha4b6fd6_2          50 KB  conda-forge
2025-05-07T20:26:07.5269622Z     libpng-1.6.47              |       h943b412_0         282 KB  conda-forge
2025-05-07T20:26:07.5270021Z     libsqlite-3.49.2           |       hee588c1_0         895 KB  conda-forge
2025-05-07T20:26:07.5270440Z     libsystemd0-256.9          |       h2774228_0         401 KB  conda-forge
2025-05-07T20:26:07.5270854Z     libudev1-257.4             |       h9a4d06a_0         140 KB  conda-forge
2025-05-07T20:26:07.5271255Z     libuuid-2.38.1             |       h0b41bf4_0          33 KB  conda-forge
2025-05-07T20:26:07.5271641Z     libxcb-1.17.0              |       h8a09558_0         387 KB  conda-forge
2025-05-07T20:26:07.5272049Z     libxkbcommon-1.8.0         |       hc4a0caf_0         627 KB  conda-forge
2025-05-07T20:26:07.5272468Z     libxkbfile-1.1.0           |       h166bdaf_1         111 KB  conda-forge
2025-05-07T20:26:07.5272874Z     libxml2-2.13.5             |       h064dc61_0         673 KB  conda-forge
2025-05-07T20:26:07.5273263Z     libzlib-1.3.1              |       hb9d3cd8_2          60 KB  conda-forge
2025-05-07T20:26:07.5273660Z     lz4-c-1.9.4                |       hcb278e6_0         140 KB  conda-forge
2025-05-07T20:26:07.5274031Z     ncurses-6.5                |       h2d0b736_3         871 KB  conda-forge
2025-05-07T20:26:07.5274462Z     nsight-compute-2025.1.0.14 |       hb5ebaad_0       320.6 MB  conda-forge
2025-05-07T20:26:07.5274883Z     nspr-4.36                  |       h5888daf_0         225 KB  conda-forge
2025-05-07T20:26:07.5275252Z     nss-3.111                  |       h159eef7_0         1.9 MB  conda-forge
2025-05-07T20:26:07.5275709Z     ocl-icd-2.3.3              |       hb9d3cd8_0         104 KB  conda-forge
2025-05-07T20:26:07.5276144Z     opencl-headers-2024.10.24  |       h5888daf_0          53 KB  conda-forge
2025-05-07T20:26:07.5276567Z     pcre2-10.44                |       hc749103_2         934 KB  conda-forge
2025-05-07T20:26:07.5276976Z     pthread-stubs-0.4          |    hb9d3cd8_1002           8 KB  conda-forge
2025-05-07T20:26:07.5277407Z     python-3.13.0              |h9ebbce0_101_cp313        31.5 MB  conda-forge
2025-05-07T20:26:07.5277923Z     rdma-core-55.0             |       h5888daf_0         1.2 MB  conda-forge
2025-05-07T20:26:07.5278322Z     sqlite-3.49.2              |       h9eae976_0         840 KB  conda-forge
2025-05-07T20:26:07.5278695Z     tk-8.6.13                  |noxft_h4845f30_101         3.2 MB  conda-forge
2025-05-07T20:26:07.5279082Z     wayland-1.23.1             |       h3e06ad9_0         314 KB  conda-forge
2025-05-07T20:26:07.5279479Z     xcb-util-0.4.1             |       hb711507_2          19 KB  conda-forge
2025-05-07T20:26:07.5279895Z     xcb-util-cursor-0.1.5      |       hb9d3cd8_0          20 KB  conda-forge
2025-05-07T20:26:07.5280337Z     xcb-util-image-0.4.0       |       hb711507_2          24 KB  conda-forge
2025-05-07T20:26:07.5280793Z     xcb-util-keysyms-0.4.1     |       hb711507_0          14 KB  conda-forge
2025-05-07T20:26:07.5281253Z     xcb-util-renderutil-0.3.10 |       hb711507_0          17 KB  conda-forge
2025-05-07T20:26:07.5281698Z     xcb-util-wm-0.4.2          |       hb711507_0          50 KB  conda-forge
2025-05-07T20:26:07.5282136Z     xkeyboard-config-2.44      |       hb9d3cd8_0         384 KB  conda-forge
2025-05-07T20:26:07.5282575Z     xorg-libice-1.1.2          |       hb9d3cd8_0          57 KB  conda-forge
2025-05-07T20:26:07.5282997Z     xorg-libsm-1.2.6           |       he73a12e_0          27 KB  conda-forge
2025-05-07T20:26:07.5283416Z     xorg-libx11-1.8.12         |       h4f16b4b_0         816 KB  conda-forge
2025-05-07T20:26:07.5283836Z     xorg-libxau-1.0.12         |       hb9d3cd8_0          14 KB  conda-forge
2025-05-07T20:26:07.5284286Z     xorg-libxcomposite-0.4.6   |       hb9d3cd8_2          13 KB  conda-forge
2025-05-07T20:26:07.5284749Z     xorg-libxdamage-1.1.6      |       hb9d3cd8_0          13 KB  conda-forge
2025-05-07T20:26:07.5285197Z     xorg-libxdmcp-1.1.5        |       hb9d3cd8_0          19 KB  conda-forge
2025-05-07T20:26:07.5285641Z     xorg-libxext-1.3.6         |       hb9d3cd8_0          49 KB  conda-forge
2025-05-07T20:26:07.5286095Z     xorg-libxfixes-6.0.1       |       hb9d3cd8_0          19 KB  conda-forge
2025-05-07T20:26:07.5286514Z     xorg-libxi-1.8.2           |       hb9d3cd8_0          46 KB  conda-forge
2025-05-07T20:26:07.5286950Z     xorg-libxrandr-1.5.4       |       hb9d3cd8_0          29 KB  conda-forge
2025-05-07T20:26:07.5287399Z     xorg-libxrender-0.9.12     |       hb9d3cd8_0          32 KB  conda-forge
2025-05-07T20:26:07.5287845Z     xorg-libxtst-1.2.5         |       hb9d3cd8_3          32 KB  conda-forge
2025-05-07T20:26:07.5288245Z     zlib-1.3.1                 |       hb9d3cd8_2          90 KB  conda-forge
2025-05-07T20:26:07.5288620Z     zstd-1.5.7                 |       hb8e6e7a_2         554 KB  conda-forge
2025-05-07T20:26:07.5288983Z     ------------------------------------------------------------
2025-05-07T20:26:07.5289311Z                                            Total:        1.91 GB
2025-05-07T20:26:07.5289520Z 
2025-05-07T20:26:07.5289644Z The following NEW packages will be INSTALLED:
2025-05-07T20:26:07.5289863Z 
2025-05-07T20:26:07.5290069Z   alsa-lib           conda-forge/linux-64::alsa-lib-1.2.14-hb9d3cd8_0 
2025-05-07T20:26:07.5290477Z   attr               conda-forge/linux-64::attr-2.5.1-h166bdaf_1 
2025-05-07T20:26:07.5290881Z   binutils           conda-forge/linux-64::binutils-2.40-h4852527_7 
2025-05-07T20:26:07.5291330Z   c-compiler         conda-forge/linux-64::c-compiler-1.5.2-h0b41bf4_0 
2025-05-07T20:26:07.5291833Z   cuda               conda-forge/noarch::cuda-12.8.0-ha804496_0 
2025-05-07T20:26:07.5292292Z   cuda-cccl_linux-64 conda-forge/noarch::cuda-cccl_linux-64-12.8.55-ha770c72_1 
2025-05-07T20:26:07.5292864Z   cuda-command-line~ conda-forge/linux-64::cuda-command-line-tools-12.8.0-ha770c72_0 
2025-05-07T20:26:07.5293423Z   cuda-compiler      conda-forge/noarch::cuda-compiler-12.8.0-hbad6d8a_0 
2025-05-07T20:26:07.5293950Z   cuda-crt-dev_linu~ conda-forge/noarch::cuda-crt-dev_linux-64-12.8.61-ha770c72_1 
2025-05-07T20:26:07.5294498Z   cuda-crt-tools     conda-forge/linux-64::cuda-crt-tools-12.8.61-ha770c72_1 
2025-05-07T20:26:07.5295094Z   cuda-cudart        conda-forge/linux-64::cuda-cudart-12.8.57-h5888daf_1 
2025-05-07T20:26:07.5295612Z   cuda-cudart-dev    conda-forge/linux-64::cuda-cudart-dev-12.8.57-h5888daf_1 
2025-05-07T20:26:07.5296166Z   cuda-cudart-dev_l~ conda-forge/noarch::cuda-cudart-dev_linux-64-12.8.57-h3f2d84a_1 
2025-05-07T20:26:07.5296765Z   cuda-cudart-static conda-forge/linux-64::cuda-cudart-static-12.8.57-h5888daf_1 
2025-05-07T20:26:07.5299880Z   cuda-cudart-stati~ conda-forge/noarch::cuda-cudart-static_linux-64-12.8.57-h3f2d84a_1 
2025-05-07T20:26:07.5300481Z   cuda-cudart_linux~ conda-forge/noarch::cuda-cudart_linux-64-12.8.57-h3f2d84a_1 
2025-05-07T20:26:07.5301026Z   cuda-cuobjdump     conda-forge/linux-64::cuda-cuobjdump-12.8.55-hbd13f7d_0 
2025-05-07T20:26:07.5301538Z   cuda-cupti         conda-forge/linux-64::cuda-cupti-12.8.57-hbd13f7d_0 
2025-05-07T20:26:07.5302018Z   cuda-cupti-dev     conda-forge/linux-64::cuda-cupti-dev-12.8.57-h5888daf_0 
2025-05-07T20:26:07.5302543Z   cuda-cuxxfilt      conda-forge/linux-64::cuda-cuxxfilt-12.8.55-hbd13f7d_0 
2025-05-07T20:26:07.5303060Z   cuda-driver-dev    conda-forge/linux-64::cuda-driver-dev-12.8.57-h5888daf_1 
2025-05-07T20:26:07.5303612Z   cuda-driver-dev_l~ conda-forge/noarch::cuda-driver-dev_linux-64-12.8.90-h3f2d84a_1 
2025-05-07T20:26:07.5304132Z   cuda-gdb           conda-forge/linux-64::cuda-gdb-12.8.55-h50b4baa_0 
2025-05-07T20:26:07.5304621Z   cuda-libraries     conda-forge/linux-64::cuda-libraries-12.8.0-ha770c72_0 
2025-05-07T20:26:07.5305166Z   cuda-libraries-dev conda-forge/linux-64::cuda-libraries-dev-12.8.0-ha770c72_0 
2025-05-07T20:26:07.5305694Z   cuda-nsight        conda-forge/linux-64::cuda-nsight-12.8.55-h7938cbb_0 
2025-05-07T20:26:07.5306150Z   cuda-nvcc          conda-forge/linux-64::cuda-nvcc-12.8.61-hcdd1206_0 
2025-05-07T20:26:07.5306661Z   cuda-nvcc-dev_lin~ conda-forge/noarch::cuda-nvcc-dev_linux-64-12.8.61-he91c749_1 
2025-05-07T20:26:07.5307220Z   cuda-nvcc-impl     conda-forge/linux-64::cuda-nvcc-impl-12.8.61-h85509e4_1 
2025-05-07T20:26:07.5307850Z   cuda-nvcc-tools    conda-forge/linux-64::cuda-nvcc-tools-12.8.61-he02047a_1 
2025-05-07T20:26:07.5308384Z   cuda-nvcc_linux-64 conda-forge/linux-64::cuda-nvcc_linux-64-12.8.61-h04802cd_0 
2025-05-07T20:26:07.5308908Z   cuda-nvdisasm      conda-forge/linux-64::cuda-nvdisasm-12.8.55-hbd13f7d_0 
2025-05-07T20:26:07.5309411Z   cuda-nvml-dev      conda-forge/linux-64::cuda-nvml-dev-12.8.55-hbd13f7d_0 
2025-05-07T20:26:07.5309901Z   cuda-nvprof        conda-forge/linux-64::cuda-nvprof-12.8.57-hbd13f7d_0 
2025-05-07T20:26:07.5310385Z   cuda-nvprune       conda-forge/linux-64::cuda-nvprune-12.8.55-hbd13f7d_0 
2025-05-07T20:26:07.5310859Z   cuda-nvrtc         conda-forge/linux-64::cuda-nvrtc-12.8.61-hbd13f7d_0 
2025-05-07T20:26:07.5311343Z   cuda-nvrtc-dev     conda-forge/linux-64::cuda-nvrtc-dev-12.8.61-h5888daf_0 
2025-05-07T20:26:07.5311829Z   cuda-nvtx          conda-forge/linux-64::cuda-nvtx-12.8.55-hbd13f7d_0 
2025-05-07T20:26:07.5312349Z   cuda-nvvm-dev_lin~ conda-forge/noarch::cuda-nvvm-dev_linux-64-12.8.61-ha770c72_1 
2025-05-07T20:26:07.5312901Z   cuda-nvvm-impl     conda-forge/linux-64::cuda-nvvm-impl-12.8.61-he02047a_1 
2025-05-07T20:26:07.5313436Z   cuda-nvvm-tools    conda-forge/linux-64::cuda-nvvm-tools-12.8.61-he02047a_1 
2025-05-07T20:26:07.5313926Z   cuda-nvvp          conda-forge/linux-64::cuda-nvvp-12.8.57-hbd13f7d_0 
2025-05-07T20:26:07.5314500Z   cuda-opencl        conda-forge/linux-64::cuda-opencl-12.8.55-hbd13f7d_0 
2025-05-07T20:26:07.5315004Z   cuda-opencl-dev    conda-forge/linux-64::cuda-opencl-dev-12.8.55-h5888daf_0 
2025-05-07T20:26:07.5315550Z   cuda-profiler-api  conda-forge/linux-64::cuda-profiler-api-12.8.55-h7938cbb_0 
2025-05-07T20:26:07.5316065Z   cuda-runtime       conda-forge/noarch::cuda-runtime-12.8.0-ha804496_0 
2025-05-07T20:26:07.5316609Z   cuda-sanitizer-api conda-forge/linux-64::cuda-sanitizer-api-12.8.55-hbd13f7d_0 
2025-05-07T20:26:07.5317142Z   cuda-toolkit       conda-forge/noarch::cuda-toolkit-12.8.0-ha804496_0 
2025-05-07T20:26:07.5317703Z   cuda-tools         conda-forge/linux-64::cuda-tools-12.8.0-ha770c72_0 
2025-05-07T20:26:07.5318170Z   cuda-version       conda-forge/noarch::cuda-version-12.8-h5d125a7_3 
2025-05-07T20:26:07.5318692Z   cuda-visual-tools  conda-forge/linux-64::cuda-visual-tools-12.8.0-ha770c72_0 
2025-05-07T20:26:07.5319220Z   cxx-compiler       conda-forge/linux-64::cxx-compiler-1.5.2-hf52228f_0 
2025-05-07T20:26:07.5319655Z   dbus               conda-forge/linux-64::dbus-1.13.6-h5008d03_3 
2025-05-07T20:26:07.5320144Z   font-ttf-dejavu-s~ conda-forge/noarch::font-ttf-dejavu-sans-mono-2.37-hab24e00_0 
2025-05-07T20:26:07.5320728Z   font-ttf-inconsol~ conda-forge/noarch::font-ttf-inconsolata-3.000-h77eed37_0 
2025-05-07T20:26:07.5321312Z   font-ttf-source-c~ conda-forge/noarch::font-ttf-source-code-pro-2.038-h77eed37_0 
2025-05-07T20:26:07.5321869Z   font-ttf-ubuntu    conda-forge/noarch::font-ttf-ubuntu-0.83-h77eed37_3 
2025-05-07T20:26:07.5322358Z   fontconfig         conda-forge/linux-64::fontconfig-2.15.0-h7e30c49_1 
2025-05-07T20:26:07.5322842Z   fonts-conda-ecosy~ conda-forge/noarch::fonts-conda-ecosystem-1-0 
2025-05-07T20:26:07.5323315Z   fonts-conda-forge  conda-forge/noarch::fonts-conda-forge-1-0 
2025-05-07T20:26:07.5323759Z   freetype           conda-forge/linux-64::freetype-2.13.3-ha770c72_1 
2025-05-07T20:26:07.5324161Z   gcc                conda-forge/linux-64::gcc-11.4.0-h602e360_13 
2025-05-07T20:26:07.5324575Z   gds-tools          conda-forge/linux-64::gds-tools-1.13.0.11-h5888daf_0 
2025-05-07T20:26:07.5324989Z   gmp                conda-forge/linux-64::gmp-6.3.0-hac33072_2 
2025-05-07T20:26:07.5325347Z   gxx                conda-forge/linux-64::gxx-11.4.0-h602e360_13 
2025-05-07T20:26:07.5325743Z   keyutils           conda-forge/linux-64::keyutils-1.6.1-h166bdaf_0 
2025-05-07T20:26:07.5326155Z   krb5               conda-forge/linux-64::krb5-1.21.3-h659f571_0 
2025-05-07T20:26:07.5326541Z   libcap             conda-forge/linux-64::libcap-2.71-h39aace5_0 
2025-05-07T20:26:07.5326978Z   libcublas          conda-forge/linux-64::libcublas-12.8.3.14-h9ab20c4_0 
2025-05-07T20:26:07.5327472Z   libcublas-dev      conda-forge/linux-64::libcublas-dev-12.8.3.14-h9ab20c4_0 
2025-05-07T20:26:07.5327954Z   libcufft           conda-forge/linux-64::libcufft-11.3.3.41-hbd13f7d_0 
2025-05-07T20:26:07.5328421Z   libcufft-dev       conda-forge/linux-64::libcufft-dev-11.3.3.41-h5888daf_0 
2025-05-07T20:26:07.5328920Z   libcufile          conda-forge/linux-64::libcufile-1.13.0.11-h12f29b5_0 
2025-05-07T20:26:07.5329410Z   libcufile-dev      conda-forge/linux-64::libcufile-dev-1.13.0.11-h5888daf_0 
2025-05-07T20:26:07.5329895Z   libcurand          conda-forge/linux-64::libcurand-10.3.9.55-hbd13f7d_0 
2025-05-07T20:26:07.5330372Z   libcurand-dev      conda-forge/linux-64::libcurand-dev-10.3.9.55-h5888daf_0 
2025-05-07T20:26:07.5330876Z   libcusolver        conda-forge/linux-64::libcusolver-11.7.2.55-h9ab20c4_0 
2025-05-07T20:26:07.5331397Z   libcusolver-dev    conda-forge/linux-64::libcusolver-dev-11.7.2.55-h9ab20c4_0 
2025-05-07T20:26:07.5331920Z   libcusparse        conda-forge/linux-64::libcusparse-12.5.7.53-hbd13f7d_0 
2025-05-07T20:26:07.5332442Z   libcusparse-dev    conda-forge/linux-64::libcusparse-dev-12.5.7.53-h5888daf_0 
2025-05-07T20:26:07.5332975Z   libedit            conda-forge/linux-64::libedit-3.1.20250104-pl5321h7949ede_0 
2025-05-07T20:26:07.5333561Z   libexpat           conda-forge/linux-64::libexpat-2.7.0-h5888daf_0 
2025-05-07T20:26:07.5334174Z   libfreetype        conda-forge/linux-64::libfreetype-2.13.3-ha770c72_1 
2025-05-07T20:26:07.5334660Z   libfreetype6       conda-forge/linux-64::libfreetype6-2.13.3-h48d6fc4_1 
2025-05-07T20:26:07.5335173Z   libgcrypt-lib      conda-forge/linux-64::libgcrypt-lib-1.11.0-hb9d3cd8_2 
2025-05-07T20:26:07.5335638Z   libglib            conda-forge/linux-64::libglib-2.84.0-h2ff4ddf_0 
2025-05-07T20:26:07.5336065Z   libglvnd           conda-forge/linux-64::libglvnd-1.7.0-ha4b6fd6_2 
2025-05-07T20:26:07.5336528Z   libgpg-error       conda-forge/linux-64::libgpg-error-1.55-h3f2d84a_0 
2025-05-07T20:26:07.5337077Z   libiconv           conda-forge/linux-64::libiconv-1.18-h4ce23a2_1 
2025-05-07T20:26:07.5337493Z   libnl              conda-forge/linux-64::libnl-3.11.0-hb9d3cd8_0 
2025-05-07T20:26:07.5337899Z   libnpp             conda-forge/linux-64::libnpp-12.3.3.65-hbd13f7d_0 
2025-05-07T20:26:07.5338353Z   libnpp-dev         conda-forge/linux-64::libnpp-dev-12.3.3.65-h5888daf_0 
2025-05-07T20:26:07.5338807Z   libnuma            conda-forge/linux-64::libnuma-2.0.18-h4ab18f5_2 
2025-05-07T20:26:07.5339270Z   libnvfatbin        conda-forge/linux-64::libnvfatbin-12.8.55-hbd13f7d_0 
2025-05-07T20:26:07.5339775Z   libnvfatbin-dev    conda-forge/linux-64::libnvfatbin-dev-12.8.55-h5888daf_0 
2025-05-07T20:26:07.5340776Z   libnvjitlink       conda-forge/linux-64::libnvjitlink-12.8.61-hbd13f7d_0 
2025-05-07T20:26:07.5341485Z   libnvjitlink-dev   conda-forge/linux-64::libnvjitlink-dev-12.8.61-h5888daf_0 
2025-05-07T20:26:07.5342009Z   libnvjpeg          conda-forge/linux-64::libnvjpeg-12.3.5.57-h97fd463_0 
2025-05-07T20:26:07.5342495Z   libnvjpeg-dev      conda-forge/linux-64::libnvjpeg-dev-12.3.5.57-ha770c72_0 
2025-05-07T20:26:07.5342983Z   libopengl          conda-forge/linux-64::libopengl-1.7.0-ha4b6fd6_2 
2025-05-07T20:26:07.5343409Z   libpng             conda-forge/linux-64::libpng-1.6.47-h943b412_0 
2025-05-07T20:26:07.5343837Z   libsqlite          conda-forge/linux-64::libsqlite-3.49.2-hee588c1_0 
2025-05-07T20:26:07.5354242Z   libsystemd0        conda-forge/linux-64::libsystemd0-256.9-h2774228_0 
2025-05-07T20:26:07.5354825Z   libudev1           conda-forge/linux-64::libudev1-257.4-h9a4d06a_0 
2025-05-07T20:26:07.5355311Z   libxcb             conda-forge/linux-64::libxcb-1.17.0-h8a09558_0 
2025-05-07T20:26:07.5355835Z   libxkbcommon       conda-forge/linux-64::libxkbcommon-1.8.0-hc4a0caf_0 
2025-05-07T20:26:07.5356390Z   libxkbfile         conda-forge/linux-64::libxkbfile-1.1.0-h166bdaf_1 
2025-05-07T20:26:07.5356829Z   libxml2            conda-forge/linux-64::libxml2-2.13.5-h064dc61_0 
2025-05-07T20:26:07.5357242Z   libzlib            conda-forge/linux-64::libzlib-1.3.1-hb9d3cd8_2 
2025-05-07T20:26:07.5357637Z   lz4-c              conda-forge/linux-64::lz4-c-1.9.4-hcb278e6_0 
2025-05-07T20:26:07.5358103Z   nsight-compute     conda-forge/linux-64::nsight-compute-2025.1.0.14-hb5ebaad_0 
2025-05-07T20:26:07.5358567Z   nspr               conda-forge/linux-64::nspr-4.36-h5888daf_0 
2025-05-07T20:26:07.5358932Z   nss                conda-forge/linux-64::nss-3.111-h159eef7_0 
2025-05-07T20:26:07.5359309Z   ocl-icd            conda-forge/linux-64::ocl-icd-2.3.3-hb9d3cd8_0 
2025-05-07T20:26:07.5359777Z   opencl-headers     conda-forge/linux-64::opencl-headers-2024.10.24-h5888daf_0 
2025-05-07T20:26:07.5360242Z   pcre2              conda-forge/linux-64::pcre2-10.44-hc749103_2 
2025-05-07T20:26:07.5360694Z   pthread-stubs      conda-forge/linux-64::pthread-stubs-0.4-hb9d3cd8_1002 
2025-05-07T20:26:07.5361157Z   rdma-core          conda-forge/linux-64::rdma-core-55.0-h5888daf_0 
2025-05-07T20:26:07.5362166Z   wayland            conda-forge/linux-64::wayland-1.23.1-h3e06ad9_0 
2025-05-07T20:26:07.5362589Z   xcb-util           conda-forge/linux-64::xcb-util-0.4.1-hb711507_2 
2025-05-07T20:26:07.5363115Z   xcb-util-cursor    conda-forge/linux-64::xcb-util-cursor-0.1.5-hb9d3cd8_0 
2025-05-07T20:26:07.5363630Z   xcb-util-image     conda-forge/linux-64::xcb-util-image-0.4.0-hb711507_2 
2025-05-07T20:26:07.5364392Z   xcb-util-keysyms   conda-forge/linux-64::xcb-util-keysyms-0.4.1-hb711507_0 
2025-05-07T20:26:07.5364958Z   xcb-util-renderut~ conda-forge/linux-64::xcb-util-renderutil-0.3.10-hb711507_0 
2025-05-07T20:26:07.5365481Z   xcb-util-wm        conda-forge/linux-64::xcb-util-wm-0.4.2-hb711507_0 
2025-05-07T20:26:07.5365974Z   xkeyboard-config   conda-forge/linux-64::xkeyboard-config-2.44-hb9d3cd8_0 
2025-05-07T20:26:07.5366482Z   xorg-libice        conda-forge/linux-64::xorg-libice-1.1.2-hb9d3cd8_0 
2025-05-07T20:26:07.5366948Z   xorg-libsm         conda-forge/linux-64::xorg-libsm-1.2.6-he73a12e_0 
2025-05-07T20:26:07.5367548Z   xorg-libx11        conda-forge/linux-64::xorg-libx11-1.8.12-h4f16b4b_0 
2025-05-07T20:26:07.5368018Z   xorg-libxau        conda-forge/linux-64::xorg-libxau-1.0.12-hb9d3cd8_0 
2025-05-07T20:26:07.5368549Z   xorg-libxcomposite conda-forge/linux-64::xorg-libxcomposite-0.4.6-hb9d3cd8_2 
2025-05-07T20:26:07.5369105Z   xorg-libxdamage    conda-forge/linux-64::xorg-libxdamage-1.1.6-hb9d3cd8_0 
2025-05-07T20:26:07.5369632Z   xorg-libxdmcp      conda-forge/linux-64::xorg-libxdmcp-1.1.5-hb9d3cd8_0 
2025-05-07T20:26:07.5370116Z   xorg-libxext       conda-forge/linux-64::xorg-libxext-1.3.6-hb9d3cd8_0 
2025-05-07T20:26:07.5370717Z   xorg-libxfixes     conda-forge/linux-64::xorg-libxfixes-6.0.1-hb9d3cd8_0 
2025-05-07T20:26:07.5371397Z   xorg-libxi         conda-forge/linux-64::xorg-libxi-1.8.2-hb9d3cd8_0 
2025-05-07T20:26:07.5372000Z   xorg-libxrandr     conda-forge/linux-64::xorg-libxrandr-1.5.4-hb9d3cd8_0 
2025-05-07T20:26:07.5372524Z   xorg-libxrender    conda-forge/linux-64::xorg-libxrender-0.9.12-hb9d3cd8_0 
2025-05-07T20:26:07.5373045Z   xorg-libxtst       conda-forge/linux-64::xorg-libxtst-1.2.5-hb9d3cd8_3 
2025-05-07T20:26:07.5373481Z   zstd               conda-forge/linux-64::zstd-1.5.7-hb8e6e7a_2 
2025-05-07T20:26:07.5373724Z 
2025-05-07T20:26:07.5373842Z The following packages will be UPDATED:
2025-05-07T20:26:07.5374045Z 
2025-05-07T20:26:07.5374313Z   libuuid              pkgs/main::libuuid-1.41.5-h5eee18b_0 --> conda-forge::libuuid-2.38.1-h0b41bf4_0 
2025-05-07T20:26:07.5374924Z   ncurses                 pkgs/main::ncurses-6.4-h6a678d5_0 --> conda-forge::ncurses-6.5-h2d0b736_3 
2025-05-07T20:26:07.5375509Z   sqlite                pkgs/main::sqlite-3.45.3-h5eee18b_0 --> conda-forge::sqlite-3.49.2-h9eae976_0 
2025-05-07T20:26:07.5376072Z   zlib                    pkgs/main::zlib-1.2.13-h5eee18b_1 --> conda-forge::zlib-1.3.1-hb9d3cd8_2 
2025-05-07T20:26:07.5376386Z 
2025-05-07T20:26:07.5376599Z The following packages will be SUPERSEDED by a higher-priority channel:
2025-05-07T20:26:07.5376909Z 
2025-05-07T20:26:07.5377145Z   expat                   pkgs/main::expat-2.7.1-h6a678d5_0 --> conda-forge::expat-2.7.0-h5888daf_0 
2025-05-07T20:26:07.5377745Z   python             pkgs/main::python-3.13.2-hf623796_100~ --> conda-forge::python-3.13.0-h9ebbce0_101_cp313 
2025-05-07T20:26:07.5378340Z   tk                        pkgs/main::tk-8.6.14-h39e8969_0 --> conda-forge::tk-8.6.13-noxft_h4845f30_101 
2025-05-07T20:26:07.5378652Z 
2025-05-07T20:26:07.5378685Z 
2025-05-07T20:26:07.5378689Z 
2025-05-07T20:26:07.5378841Z Downloading and Extracting Packages: ...working...
2025-05-07T20:26:07.5379211Z libcublas-12.8.3.14  | 460.2 MB  |            |   0% 
2025-05-07T20:26:07.5379447Z 
2025-05-07T20:26:07.5379843Z nsight-compute-2025. | 320.6 MB  |            |   0% [A
2025-05-07T20:26:07.5380090Z 
2025-05-07T20:26:07.5380095Z 
2025-05-07T20:26:07.5380318Z libcusparse-12.5.7.5 | 164.9 MB  |            |   0% [A[A
2025-05-07T20:26:07.5380567Z 
2025-05-07T20:26:07.5380571Z 
2025-05-07T20:26:07.5380575Z 
2025-05-07T20:26:07.5380814Z libcusolver-11.7.2.5 | 156.9 MB  |            |   0% [A[A[A
2025-05-07T20:26:07.5381064Z 
2025-05-07T20:26:07.5381068Z 
2025-05-07T20:26:07.5381072Z 
2025-05-07T20:26:07.5381075Z 
2025-05-07T20:26:07.5381306Z libcufft-11.3.3.41   | 147.4 MB  |            |   0% [A[A[A[A
2025-05-07T20:26:07.5381555Z 
2025-05-07T20:26:07.5381559Z 
2025-05-07T20:26:07.5381662Z 
2025-05-07T20:26:07.5381666Z 
2025-05-07T20:26:07.5381670Z 
2025-05-07T20:26:07.5386267Z libnpp-12.3.3.65     | 130.6 MB  |            |   0% [A[A[A[A[A
2025-05-07T20:26:07.5386524Z 
2025-05-07T20:26:07.5386528Z 
2025-05-07T20:26:07.5386532Z 
2025-05-07T20:26:07.5386536Z 
2025-05-07T20:26:07.5386539Z 
2025-05-07T20:26:07.5390397Z 
2025-05-07T20:26:07.5391851Z cuda-nsight-12.8.55  | 113.2 MB  |            |   0% [A[A[A[A[A[A
2025-05-07T20:26:07.5392139Z 
2025-05-07T20:26:07.5392143Z 
2025-05-07T20:26:07.5392147Z 
2025-05-07T20:26:07.5392151Z 
2025-05-07T20:26:07.5392154Z 
2025-05-07T20:26:07.5392158Z 
2025-05-07T20:26:07.5392162Z 
2025-05-07T20:26:07.5393526Z cuda-nvvp-12.8.57    | 112.4 MB  |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:26:07.5393809Z 
2025-05-07T20:26:07.5393813Z 
2025-05-07T20:26:07.5393817Z 
2025-05-07T20:26:07.5393820Z 
2025-05-07T20:26:07.5393824Z 
2025-05-07T20:26:07.5393828Z 
2025-05-07T20:26:07.5393831Z 
2025-05-07T20:26:07.5393835Z 
2025-05-07T20:26:07.5394664Z cuda-nvrtc-12.8.61   | 63.1 MB   |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:26:07.5394935Z 
2025-05-07T20:26:07.5394939Z 
2025-05-07T20:26:07.5394943Z 
2025-05-07T20:26:07.5394947Z 
2025-05-07T20:26:07.5394960Z 
2025-05-07T20:26:07.5394964Z 
2025-05-07T20:26:07.5394968Z 
2025-05-07T20:26:07.5394971Z 
2025-05-07T20:26:07.5394975Z 
2025-05-07T20:26:07.5396799Z libcurand-10.3.9.55  | 43.6 MB   |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.5397172Z 
2025-05-07T20:26:07.5397178Z 
2025-05-07T20:26:07.5397184Z 
2025-05-07T20:26:07.5397189Z 
2025-05-07T20:26:07.5397194Z 
2025-05-07T20:26:07.5397200Z 
2025-05-07T20:26:07.5397216Z 
2025-05-07T20:26:07.5397222Z 
2025-05-07T20:26:07.5397225Z 
2025-05-07T20:26:07.5397229Z 
2025-05-07T20:26:07.5400223Z gds-tools-1.13.0.11  | 37.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.5400609Z 
2025-05-07T20:26:07.5400613Z 
2025-05-07T20:26:07.5400617Z 
2025-05-07T20:26:07.5400621Z 
2025-05-07T20:26:07.5400631Z 
2025-05-07T20:26:07.5400635Z 
2025-05-07T20:26:07.5400639Z 
2025-05-07T20:26:07.5400642Z 
2025-05-07T20:26:07.5400646Z 
2025-05-07T20:26:07.5400650Z 
2025-05-07T20:26:07.5400654Z 
2025-05-07T20:26:07.5402945Z python-3.13.0        | 31.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.5403270Z 
2025-05-07T20:26:07.5403276Z 
2025-05-07T20:26:07.5403281Z 
2025-05-07T20:26:07.5403286Z 
2025-05-07T20:26:07.5403291Z 
2025-05-07T20:26:07.5403297Z 
2025-05-07T20:26:07.5403302Z 
2025-05-07T20:26:07.5403307Z 
2025-05-07T20:26:07.5403312Z 
2025-05-07T20:26:07.5403318Z 
2025-05-07T20:26:07.5403323Z 
2025-05-07T20:26:07.5404436Z 
2025-05-07T20:26:07.5405932Z libnvjitlink-12.8.61 | 28.7 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.5406289Z 
2025-05-07T20:26:07.5406292Z 
2025-05-07T20:26:07.5406304Z 
2025-05-07T20:26:07.5406307Z 
2025-05-07T20:26:07.5406311Z 
2025-05-07T20:26:07.5406315Z 
2025-05-07T20:26:07.5406465Z 
2025-05-07T20:26:07.5406471Z 
2025-05-07T20:26:07.5406486Z 
2025-05-07T20:26:07.5406490Z 
2025-05-07T20:26:07.5406494Z 
2025-05-07T20:26:07.5406498Z 
2025-05-07T20:26:07.5406502Z 
2025-05-07T20:26:07.5407383Z cuda-nvcc-tools-12.8 | 24.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.5407829Z 
2025-05-07T20:26:07.5407835Z 
2025-05-07T20:26:07.5407860Z 
2025-05-07T20:26:07.5407865Z 
2025-05-07T20:26:07.5407871Z 
2025-05-07T20:26:07.5407876Z 
2025-05-07T20:26:07.5407881Z 
2025-05-07T20:26:07.5407887Z 
2025-05-07T20:26:07.5407892Z 
2025-05-07T20:26:07.5407898Z 
2025-05-07T20:26:07.5407903Z 
2025-05-07T20:26:07.5407908Z 
2025-05-07T20:26:07.5407914Z 
2025-05-07T20:26:07.5407927Z 
2025-05-07T20:26:07.5409619Z cuda-nvvm-tools-12.8 | 23.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.5409948Z 
2025-05-07T20:26:07.5409951Z 
2025-05-07T20:26:07.5409955Z 
2025-05-07T20:26:07.5409959Z 
2025-05-07T20:26:07.5409963Z 
2025-05-07T20:26:07.5409966Z 
2025-05-07T20:26:07.5409970Z 
2025-05-07T20:26:07.5410088Z 
2025-05-07T20:26:07.5410092Z 
2025-05-07T20:26:07.5410096Z 
2025-05-07T20:26:07.5410099Z 
2025-05-07T20:26:07.5410103Z 
2025-05-07T20:26:07.5410107Z 
2025-05-07T20:26:07.5410111Z 
2025-05-07T20:26:07.5410122Z 
2025-05-07T20:26:07.5410816Z cuda-nvvm-impl-12.8. | 20.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.5411150Z 
2025-05-07T20:26:07.5411154Z 
2025-05-07T20:26:07.5411157Z 
2025-05-07T20:26:07.5411170Z 
2025-05-07T20:26:07.5411173Z 
2025-05-07T20:26:07.5411177Z 
2025-05-07T20:26:07.5411181Z 
2025-05-07T20:26:07.5411184Z 
2025-05-07T20:26:07.5411188Z 
2025-05-07T20:26:07.5411295Z 
2025-05-07T20:26:07.5411299Z 
2025-05-07T20:26:07.5411303Z 
2025-05-07T20:26:07.5411307Z 
2025-05-07T20:26:07.5411310Z 
2025-05-07T20:26:07.5411314Z 
2025-05-07T20:26:07.5411317Z 
2025-05-07T20:26:07.5412237Z cuda-nvcc-dev_linux- | 12.7 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.5412605Z 
2025-05-07T20:26:07.5412617Z 
2025-05-07T20:26:07.5412622Z 
2025-05-07T20:26:07.5412626Z 
2025-05-07T20:26:07.5412629Z 
2025-05-07T20:26:07.5412633Z 
2025-05-07T20:26:07.5412642Z 
2025-05-07T20:26:07.5412645Z 
2025-05-07T20:26:07.5412649Z 
2025-05-07T20:26:07.5412653Z 
2025-05-07T20:26:07.5412657Z 
2025-05-07T20:26:07.5412660Z 
2025-05-07T20:26:07.5412664Z 
2025-05-07T20:26:07.5412668Z 
2025-05-07T20:26:07.5412682Z 
2025-05-07T20:26:07.5412688Z 
2025-05-07T20:26:07.5412693Z 
2025-05-07T20:26:07.5414018Z cuda-sanitizer-api-1 | 8.8 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.5414482Z 
2025-05-07T20:26:07.5414486Z 
2025-05-07T20:26:07.5414498Z 
2025-05-07T20:26:07.5414502Z 
2025-05-07T20:26:07.5414505Z 
2025-05-07T20:26:07.5414509Z 
2025-05-07T20:26:07.5414513Z 
2025-05-07T20:26:07.5414516Z 
2025-05-07T20:26:07.5414520Z 
2025-05-07T20:26:07.5414523Z 
2025-05-07T20:26:07.5414527Z 
2025-05-07T20:26:07.5414531Z 
2025-05-07T20:26:07.5414534Z 
2025-05-07T20:26:07.5414538Z 
2025-05-07T20:26:07.5414546Z 
2025-05-07T20:26:07.5414550Z 
2025-05-07T20:26:07.5414553Z 
2025-05-07T20:26:07.5414557Z 
2025-05-07T20:26:07.5415525Z cuda-nvdisasm-12.8.5 | 4.9 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.5415906Z 
2025-05-07T20:26:07.5415910Z 
2025-05-07T20:26:07.5415914Z 
2025-05-07T20:26:07.5415917Z 
2025-05-07T20:26:07.5415921Z 
2025-05-07T20:26:07.5415925Z 
2025-05-07T20:26:07.5415934Z 
2025-05-07T20:26:07.5415937Z 
2025-05-07T20:26:07.5415941Z 
2025-05-07T20:26:07.5415945Z 
2025-05-07T20:26:07.5415954Z 
2025-05-07T20:26:07.5415958Z 
2025-05-07T20:26:07.5415962Z 
2025-05-07T20:26:07.5415971Z 
2025-05-07T20:26:07.5415975Z 
2025-05-07T20:26:07.5415979Z 
2025-05-07T20:26:07.5415982Z 
2025-05-07T20:26:07.5415986Z 
2025-05-07T20:26:07.5415990Z 
2025-05-07T20:26:07.6312440Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:07.6313955Z libcublas-12.8.3.14  | 460.2 MB  |            |   0% 
2025-05-07T20:26:07.6315634Z 
2025-05-07T20:26:07.6323563Z nsight-compute-2025. | 320.6 MB  |            |   0% [A
2025-05-07T20:26:07.6323949Z 
2025-05-07T20:26:07.6326089Z 
2025-05-07T20:26:07.6360049Z libcusparse-12.5.7.5 | 164.9 MB  |            |   0% [A[A
2025-05-07T20:26:07.6360396Z 
2025-05-07T20:26:07.6360402Z 
2025-05-07T20:26:07.6360407Z 
2025-05-07T20:26:07.6363219Z 
2025-05-07T20:26:07.6446688Z libcufft-11.3.3.41   | 147.4 MB  |            |   0% [A[A[A[A
2025-05-07T20:26:07.6447032Z 
2025-05-07T20:26:07.6447036Z 
2025-05-07T20:26:07.6447479Z 
2025-05-07T20:26:07.7313940Z libcusolver-11.7.2.5 | 156.9 MB  |            |   0% [A[A[A
2025-05-07T20:26:07.7316258Z libcublas-12.8.3.14  | 460.2 MB  | 1          |   1% 
2025-05-07T20:26:07.7316720Z 
2025-05-07T20:26:07.7324721Z nsight-compute-2025. | 320.6 MB  | 1          |   1% [A
2025-05-07T20:26:07.7325085Z 
2025-05-07T20:26:07.7327377Z 
2025-05-07T20:26:07.7360272Z libcusparse-12.5.7.5 | 164.9 MB  | 2          |   2% [A[A
2025-05-07T20:26:07.7360912Z 
2025-05-07T20:26:07.7360928Z 
2025-05-07T20:26:07.7360934Z 
2025-05-07T20:26:07.7361246Z 
2025-05-07T20:26:07.7516901Z libcufft-11.3.3.41   | 147.4 MB  | 2          |   2% [A[A[A[A
2025-05-07T20:26:07.7517558Z 
2025-05-07T20:26:07.7517573Z 
2025-05-07T20:26:07.7518096Z 
2025-05-07T20:26:07.8314538Z libcusolver-11.7.2.5 | 156.9 MB  | 1          |   2% [A[A[A
2025-05-07T20:26:07.8322032Z libcublas-12.8.3.14  | 460.2 MB  | 1          |   2% 
2025-05-07T20:26:07.8325305Z 
2025-05-07T20:26:07.8330798Z nsight-compute-2025. | 320.6 MB  | 2          |   2% [A
2025-05-07T20:26:07.8331146Z 
2025-05-07T20:26:07.8331698Z 
2025-05-07T20:26:07.8363360Z libcusparse-12.5.7.5 | 164.9 MB  | 4          |   5% [A[A
2025-05-07T20:26:07.8363736Z 
2025-05-07T20:26:07.8363743Z 
2025-05-07T20:26:07.8363748Z 
2025-05-07T20:26:07.8363869Z 
2025-05-07T20:26:07.8522550Z libcufft-11.3.3.41   | 147.4 MB  | 4          |   5% [A[A[A[A
2025-05-07T20:26:07.8522907Z 
2025-05-07T20:26:07.8522913Z 
2025-05-07T20:26:07.8523760Z 
2025-05-07T20:26:07.9316647Z libcusolver-11.7.2.5 | 156.9 MB  | 3          |   3% [A[A[A
2025-05-07T20:26:07.9323132Z libcublas-12.8.3.14  | 460.2 MB  | 2          |   3% 
2025-05-07T20:26:07.9328329Z 
2025-05-07T20:26:07.9333499Z nsight-compute-2025. | 320.6 MB  | 3          |   4% [A
2025-05-07T20:26:07.9333871Z 
2025-05-07T20:26:07.9334289Z 
2025-05-07T20:26:07.9368655Z libcusparse-12.5.7.5 | 164.9 MB  | 7          |   7% [A[A
2025-05-07T20:26:07.9369026Z 
2025-05-07T20:26:07.9369030Z 
2025-05-07T20:26:07.9369034Z 
2025-05-07T20:26:07.9369038Z 
2025-05-07T20:26:07.9528127Z libcufft-11.3.3.41   | 147.4 MB  | 6          |   7% [A[A[A[A
2025-05-07T20:26:07.9528500Z 
2025-05-07T20:26:07.9528506Z 
2025-05-07T20:26:07.9528969Z 
2025-05-07T20:26:08.0319542Z libcusolver-11.7.2.5 | 156.9 MB  | 5          |   5% [A[A[A
2025-05-07T20:26:08.0328236Z libcublas-12.8.3.14  | 460.2 MB  | 3          |   3% 
2025-05-07T20:26:08.0332190Z 
2025-05-07T20:26:08.0341584Z nsight-compute-2025. | 320.6 MB  | 4          |   5% [A
2025-05-07T20:26:08.0341866Z 
2025-05-07T20:26:08.0343755Z 
2025-05-07T20:26:08.0404241Z libcusparse-12.5.7.5 | 164.9 MB  | 9          |   9% [A[A
2025-05-07T20:26:08.0404617Z 
2025-05-07T20:26:08.0404623Z 
2025-05-07T20:26:08.0404628Z 
2025-05-07T20:26:08.0407452Z 
2025-05-07T20:26:08.0528251Z libcufft-11.3.3.41   | 147.4 MB  | 9          |   9% [A[A[A[A
2025-05-07T20:26:08.0528592Z 
2025-05-07T20:26:08.0528599Z 
2025-05-07T20:26:08.0529361Z 
2025-05-07T20:26:08.1328006Z libcusolver-11.7.2.5 | 156.9 MB  | 7          |   7% [A[A[A
2025-05-07T20:26:08.1334144Z libcublas-12.8.3.14  | 460.2 MB  | 4          |   4% 
2025-05-07T20:26:08.1336069Z 
2025-05-07T20:26:08.1340332Z nsight-compute-2025. | 320.6 MB  | 5          |   6% [A
2025-05-07T20:26:08.1340833Z 
2025-05-07T20:26:08.1340837Z 
2025-05-07T20:26:08.1628311Z libcusparse-12.5.7.5 | 164.9 MB  | #1         |  12% [A[A
2025-05-07T20:26:08.1628746Z 
2025-05-07T20:26:08.1628753Z 
2025-05-07T20:26:08.1631051Z 
2025-05-07T20:26:08.1681126Z libcusolver-11.7.2.5 | 156.9 MB  | 9          |   9% [A[A[A
2025-05-07T20:26:08.1681442Z 
2025-05-07T20:26:08.1681447Z 
2025-05-07T20:26:08.1681453Z 
2025-05-07T20:26:08.1681461Z 
2025-05-07T20:26:08.2381644Z libcufft-11.3.3.41   | 147.4 MB  | #1         |  11% [A[A[A[A
2025-05-07T20:26:08.2382071Z 
2025-05-07T20:26:08.2382080Z 
2025-05-07T20:26:08.2435456Z libcusparse-12.5.7.5 | 164.9 MB  | #4         |  14% [A[A
2025-05-07T20:26:08.2484717Z libcublas-12.8.3.14  | 460.2 MB  | 5          |   5% 
2025-05-07T20:26:08.2490772Z 
2025-05-07T20:26:08.2629388Z nsight-compute-2025. | 320.6 MB  | 6          |   7% [A
2025-05-07T20:26:08.2629672Z 
2025-05-07T20:26:08.2629702Z 
2025-05-07T20:26:08.2632662Z 
2025-05-07T20:26:08.2687339Z libcusolver-11.7.2.5 | 156.9 MB  | #1         |  11% [A[A[A
2025-05-07T20:26:08.2687638Z 
2025-05-07T20:26:08.2687645Z 
2025-05-07T20:26:08.2687651Z 
2025-05-07T20:26:08.2689183Z 
2025-05-07T20:26:08.3479467Z libcufft-11.3.3.41   | 147.4 MB  | #3         |  13% [A[A[A[A
2025-05-07T20:26:08.3534263Z libcublas-12.8.3.14  | 460.2 MB  | 5          |   6% 
2025-05-07T20:26:08.3534578Z 
2025-05-07T20:26:08.3536750Z 
2025-05-07T20:26:08.3632718Z libcusparse-12.5.7.5 | 164.9 MB  | #6         |  16% [A[A
2025-05-07T20:26:08.3633019Z 
2025-05-07T20:26:08.3633025Z 
2025-05-07T20:26:08.3633170Z 
2025-05-07T20:26:08.3639376Z libcusolver-11.7.2.5 | 156.9 MB  | #3         |  13% [A[A[A
2025-05-07T20:26:08.3639848Z 
2025-05-07T20:26:08.3689749Z nsight-compute-2025. | 320.6 MB  | 7          |   8% [A
2025-05-07T20:26:08.3690011Z 
2025-05-07T20:26:08.3690017Z 
2025-05-07T20:26:08.3690023Z 
2025-05-07T20:26:08.3694406Z 
2025-05-07T20:26:08.4510275Z libcufft-11.3.3.41   | 147.4 MB  | #5         |  15% [A[A[A[A
2025-05-07T20:26:08.4535556Z libcublas-12.8.3.14  | 460.2 MB  | 6          |   7% 
2025-05-07T20:26:08.4535804Z 
2025-05-07T20:26:08.4536179Z 
2025-05-07T20:26:08.4632759Z libcusparse-12.5.7.5 | 164.9 MB  | #8         |  18% [A[A
2025-05-07T20:26:08.4633059Z 
2025-05-07T20:26:08.4633096Z 
2025-05-07T20:26:08.4633739Z 
2025-05-07T20:26:08.4641471Z libcusolver-11.7.2.5 | 156.9 MB  | #5         |  16% [A[A[A
2025-05-07T20:26:08.4641785Z 
2025-05-07T20:26:08.4691569Z nsight-compute-2025. | 320.6 MB  | 8          |   9% [A
2025-05-07T20:26:08.4691833Z 
2025-05-07T20:26:08.4691839Z 
2025-05-07T20:26:08.4691844Z 
2025-05-07T20:26:08.4691852Z 
2025-05-07T20:26:08.5511927Z libcufft-11.3.3.41   | 147.4 MB  | #7         |  17% [A[A[A[A
2025-05-07T20:26:08.5539590Z libcublas-12.8.3.14  | 460.2 MB  | 7          |   7% 
2025-05-07T20:26:08.5539840Z 
2025-05-07T20:26:08.5544632Z 
2025-05-07T20:26:08.5635689Z libcusparse-12.5.7.5 | 164.9 MB  | ##         |  21% [A[A
2025-05-07T20:26:08.5635968Z 
2025-05-07T20:26:08.5635974Z 
2025-05-07T20:26:08.5635979Z 
2025-05-07T20:26:08.5697147Z libcusolver-11.7.2.5 | 156.9 MB  | #7         |  18% [A[A[A
2025-05-07T20:26:08.5697427Z 
2025-05-07T20:26:08.5697431Z 
2025-05-07T20:26:08.5697442Z 
2025-05-07T20:26:08.5698571Z 
2025-05-07T20:26:08.5715618Z libcufft-11.3.3.41   | 147.4 MB  | #9         |  19% [A[A[A[A
2025-05-07T20:26:08.5717297Z 
2025-05-07T20:26:08.6516524Z nsight-compute-2025. | 320.6 MB  | 9          |  10% [A
2025-05-07T20:26:08.6541985Z libcublas-12.8.3.14  | 460.2 MB  | 7          |   8% 
2025-05-07T20:26:08.6542222Z 
2025-05-07T20:26:08.6542227Z 
2025-05-07T20:26:08.6637916Z libcusparse-12.5.7.5 | 164.9 MB  | ##2        |  23% [A[A
2025-05-07T20:26:08.6638239Z 
2025-05-07T20:26:08.6638245Z 
2025-05-07T20:26:08.6639895Z 
2025-05-07T20:26:08.6700270Z libcusolver-11.7.2.5 | 156.9 MB  | ##         |  20% [A[A[A
2025-05-07T20:26:08.6700543Z 
2025-05-07T20:26:08.6700574Z 
2025-05-07T20:26:08.6700579Z 
2025-05-07T20:26:08.6701279Z 
2025-05-07T20:26:08.6729999Z libcufft-11.3.3.41   | 147.4 MB  | ##1        |  21% [A[A[A[A
2025-05-07T20:26:08.6731465Z 
2025-05-07T20:26:08.7518306Z nsight-compute-2025. | 320.6 MB  | #          |  11% [A
2025-05-07T20:26:08.7568622Z libcublas-12.8.3.14  | 460.2 MB  | 8          |   9% 
2025-05-07T20:26:08.7568897Z 
2025-05-07T20:26:08.7568901Z 
2025-05-07T20:26:08.7639231Z libcusparse-12.5.7.5 | 164.9 MB  | ##5        |  25% [A[A
2025-05-07T20:26:08.7639517Z 
2025-05-07T20:26:08.7639521Z 
2025-05-07T20:26:08.7639525Z 
2025-05-07T20:26:08.7701885Z libcusolver-11.7.2.5 | 156.9 MB  | ##2        |  22% [A[A[A
2025-05-07T20:26:08.7702223Z 
2025-05-07T20:26:08.7702227Z 
2025-05-07T20:26:08.7702231Z 
2025-05-07T20:26:08.7704560Z 
2025-05-07T20:26:08.7731189Z libcufft-11.3.3.41   | 147.4 MB  | ##3        |  24% [A[A[A[A
2025-05-07T20:26:08.7731474Z 
2025-05-07T20:26:08.8520583Z nsight-compute-2025. | 320.6 MB  | #1         |  12% [A
2025-05-07T20:26:08.8625166Z libcublas-12.8.3.14  | 460.2 MB  | 9          |   9% 
2025-05-07T20:26:08.8625418Z 
2025-05-07T20:26:08.8628169Z 
2025-05-07T20:26:08.8650263Z libcusparse-12.5.7.5 | 164.9 MB  | ##7        |  27% [A[A
2025-05-07T20:26:08.8650561Z 
2025-05-07T20:26:08.8650567Z 
2025-05-07T20:26:08.8650572Z 
2025-05-07T20:26:08.8718964Z libcusolver-11.7.2.5 | 156.9 MB  | ##4        |  24% [A[A[A
2025-05-07T20:26:08.8719505Z 
2025-05-07T20:26:08.8719509Z 
2025-05-07T20:26:08.8719513Z 
2025-05-07T20:26:08.8719517Z 
2025-05-07T20:26:08.8782880Z libcufft-11.3.3.41   | 147.4 MB  | ##5        |  26% [A[A[A[A
2025-05-07T20:26:08.8783486Z 
2025-05-07T20:26:08.9521979Z nsight-compute-2025. | 320.6 MB  | #2         |  13% [A
2025-05-07T20:26:08.9627618Z libcublas-12.8.3.14  | 460.2 MB  | #          |  10% 
2025-05-07T20:26:08.9627984Z 
2025-05-07T20:26:08.9630211Z 
2025-05-07T20:26:08.9720271Z libcusparse-12.5.7.5 | 164.9 MB  | ##9        |  30% [A[A
2025-05-07T20:26:08.9720540Z 
2025-05-07T20:26:08.9720782Z 
2025-05-07T20:26:08.9720787Z 
2025-05-07T20:26:08.9724479Z 
2025-05-07T20:26:08.9785970Z libcufft-11.3.3.41   | 147.4 MB  | ##8        |  28% [A[A[A[A
2025-05-07T20:26:08.9786262Z 
2025-05-07T20:26:08.9839415Z nsight-compute-2025. | 320.6 MB  | #4         |  14% [A
2025-05-07T20:26:08.9839686Z 
2025-05-07T20:26:08.9839810Z 
2025-05-07T20:26:08.9840023Z 
2025-05-07T20:26:09.0522288Z libcusolver-11.7.2.5 | 156.9 MB  | ##6        |  27% [A[A[A
2025-05-07T20:26:09.0636440Z libcublas-12.8.3.14  | 460.2 MB  | #1         |  11% 
2025-05-07T20:26:09.0636720Z 
2025-05-07T20:26:09.0636726Z 
2025-05-07T20:26:09.0723996Z libcusparse-12.5.7.5 | 164.9 MB  | ###2       |  32% [A[A
2025-05-07T20:26:09.0724377Z 
2025-05-07T20:26:09.0724384Z 
2025-05-07T20:26:09.0724389Z 
2025-05-07T20:26:09.0726681Z 
2025-05-07T20:26:09.0788899Z libcufft-11.3.3.41   | 147.4 MB  | ###        |  31% [A[A[A[A
2025-05-07T20:26:09.0791009Z 
2025-05-07T20:26:09.1377878Z nsight-compute-2025. | 320.6 MB  | #5         |  15% [A
2025-05-07T20:26:09.1378297Z 
2025-05-07T20:26:09.1378304Z 
2025-05-07T20:26:09.1378310Z 
2025-05-07T20:26:09.1523634Z libcusolver-11.7.2.5 | 156.9 MB  | ##8        |  29% [A[A[A
2025-05-07T20:26:09.1637037Z libcublas-12.8.3.14  | 460.2 MB  | #1         |  12% 
2025-05-07T20:26:09.1637308Z 
2025-05-07T20:26:09.1637349Z 
2025-05-07T20:26:09.1728316Z libcusparse-12.5.7.5 | 164.9 MB  | ###4       |  35% [A[A
2025-05-07T20:26:09.1728603Z 
2025-05-07T20:26:09.1728607Z 
2025-05-07T20:26:09.1728611Z 
2025-05-07T20:26:09.1728615Z 
2025-05-07T20:26:09.1798817Z libcufft-11.3.3.41   | 147.4 MB  | ###3       |  33% [A[A[A[A
2025-05-07T20:26:09.1799476Z 
2025-05-07T20:26:09.2525157Z nsight-compute-2025. | 320.6 MB  | #6         |  17% [A
2025-05-07T20:26:09.2638685Z libcublas-12.8.3.14  | 460.2 MB  | #2         |  13% 
2025-05-07T20:26:09.2639036Z 
2025-05-07T20:26:09.2639468Z 
2025-05-07T20:26:09.2731406Z libcusparse-12.5.7.5 | 164.9 MB  | ###6       |  37% [A[A
2025-05-07T20:26:09.2731675Z 
2025-05-07T20:26:09.2731704Z 
2025-05-07T20:26:09.2731708Z 
2025-05-07T20:26:09.2735659Z 
2025-05-07T20:26:09.2798745Z libcufft-11.3.3.41   | 147.4 MB  | ###5       |  36% [A[A[A[A
2025-05-07T20:26:09.2800346Z 
2025-05-07T20:26:09.3279178Z nsight-compute-2025. | 320.6 MB  | #7         |  18% [A
2025-05-07T20:26:09.3279468Z 
2025-05-07T20:26:09.3279472Z 
2025-05-07T20:26:09.3280171Z 
2025-05-07T20:26:09.3573370Z libcusolver-11.7.2.5 | 156.9 MB  | ###        |  30% [A[A[A
2025-05-07T20:26:09.3679300Z libcublas-12.8.3.14  | 460.2 MB  | #3         |  14% 
2025-05-07T20:26:09.3679562Z 
2025-05-07T20:26:09.3679566Z 
2025-05-07T20:26:09.3806248Z libcusparse-12.5.7.5 | 164.9 MB  | ###9       |  39% [A[A
2025-05-07T20:26:09.3807337Z 
2025-05-07T20:26:09.4013075Z nsight-compute-2025. | 320.6 MB  | #8         |  19% [A
2025-05-07T20:26:09.4013441Z 
2025-05-07T20:26:09.4013446Z 
2025-05-07T20:26:09.4013449Z 
2025-05-07T20:26:09.4014245Z 
2025-05-07T20:26:09.4283617Z libcufft-11.3.3.41   | 147.4 MB  | ###8       |  38% [A[A[A[A
2025-05-07T20:26:09.4284037Z 
2025-05-07T20:26:09.4284044Z 
2025-05-07T20:26:09.4284050Z 
2025-05-07T20:26:09.4621827Z libcusolver-11.7.2.5 | 156.9 MB  | ###2       |  33% [A[A[A
2025-05-07T20:26:09.4721294Z libcublas-12.8.3.14  | 460.2 MB  | #4         |  15% 
2025-05-07T20:26:09.4721607Z 
2025-05-07T20:26:09.4721614Z 
2025-05-07T20:26:09.4965637Z libcusparse-12.5.7.5 | 164.9 MB  | ####1      |  42% [A[A
2025-05-07T20:26:09.4966669Z 
2025-05-07T20:26:09.5082774Z nsight-compute-2025. | 320.6 MB  | ##         |  20% [A
2025-05-07T20:26:09.5083175Z 
2025-05-07T20:26:09.5083249Z 
2025-05-07T20:26:09.5083254Z 
2025-05-07T20:26:09.5083886Z 
2025-05-07T20:26:09.5286763Z libcufft-11.3.3.41   | 147.4 MB  | ####       |  40% [A[A[A[A
2025-05-07T20:26:09.5287047Z 
2025-05-07T20:26:09.5287051Z 
2025-05-07T20:26:09.5287055Z 
2025-05-07T20:26:09.5722564Z libcusolver-11.7.2.5 | 156.9 MB  | ###5       |  35% [A[A[A
2025-05-07T20:26:09.5722871Z 
2025-05-07T20:26:09.5724497Z 
2025-05-07T20:26:09.5826637Z libcusparse-12.5.7.5 | 164.9 MB  | ####3      |  44% [A[A
2025-05-07T20:26:09.6094048Z libcublas-12.8.3.14  | 460.2 MB  | #5         |  15% 
2025-05-07T20:26:09.6095776Z 
2025-05-07T20:26:09.6260071Z nsight-compute-2025. | 320.6 MB  | ##1        |  21% [A
2025-05-07T20:26:09.6260336Z 
2025-05-07T20:26:09.6260340Z 
2025-05-07T20:26:09.6260344Z 
2025-05-07T20:26:09.6261430Z 
2025-05-07T20:26:09.6291281Z libcufft-11.3.3.41   | 147.4 MB  | ####2      |  43% [A[A[A[A
2025-05-07T20:26:09.6291549Z 
2025-05-07T20:26:09.6291553Z 
2025-05-07T20:26:09.6291557Z 
2025-05-07T20:26:09.6733557Z libcusolver-11.7.2.5 | 156.9 MB  | ###7       |  37% [A[A[A
2025-05-07T20:26:09.6733841Z 
2025-05-07T20:26:09.6734011Z 
2025-05-07T20:26:09.6886593Z libcusparse-12.5.7.5 | 164.9 MB  | ####6      |  46% [A[A
2025-05-07T20:26:09.7254116Z libcublas-12.8.3.14  | 460.2 MB  | #6         |  16% 
2025-05-07T20:26:09.7254377Z 
2025-05-07T20:26:09.7290393Z nsight-compute-2025. | 320.6 MB  | ##2        |  22% [A
2025-05-07T20:26:09.7290680Z 
2025-05-07T20:26:09.7290685Z 
2025-05-07T20:26:09.7290689Z 
2025-05-07T20:26:09.7291872Z 
2025-05-07T20:26:09.7316297Z libcufft-11.3.3.41   | 147.4 MB  | ####4      |  45% [A[A[A[A
2025-05-07T20:26:09.7316569Z 
2025-05-07T20:26:09.7316574Z 
2025-05-07T20:26:09.7319392Z 
2025-05-07T20:26:09.7901781Z libcusolver-11.7.2.5 | 156.9 MB  | ###9       |  39% [A[A[A
2025-05-07T20:26:09.7902094Z 
2025-05-07T20:26:09.7902098Z 
2025-05-07T20:26:09.7951147Z libcusparse-12.5.7.5 | 164.9 MB  | ####8      |  49% [A[A
2025-05-07T20:26:09.8293964Z libcublas-12.8.3.14  | 460.2 MB  | #6         |  17% 
2025-05-07T20:26:09.8294222Z 
2025-05-07T20:26:09.8294227Z 
2025-05-07T20:26:09.8294230Z 
2025-05-07T20:26:09.8296742Z 
2025-05-07T20:26:09.8316555Z libcufft-11.3.3.41   | 147.4 MB  | ####6      |  47% [A[A[A[A
2025-05-07T20:26:09.8316839Z 
2025-05-07T20:26:09.8316843Z 
2025-05-07T20:26:09.8317871Z 
2025-05-07T20:26:09.8383550Z libcusolver-11.7.2.5 | 156.9 MB  | ####1      |  42% [A[A[A
2025-05-07T20:26:09.8383857Z 
2025-05-07T20:26:09.8985091Z nsight-compute-2025. | 320.6 MB  | ##3        |  23% [A
2025-05-07T20:26:09.8985365Z 
2025-05-07T20:26:09.8985369Z 
2025-05-07T20:26:09.9069430Z libcusparse-12.5.7.5 | 164.9 MB  | #####      |  51% [A[A
2025-05-07T20:26:09.9317702Z libcublas-12.8.3.14  | 460.2 MB  | #7         |  18% 
2025-05-07T20:26:09.9317965Z 
2025-05-07T20:26:09.9317995Z 
2025-05-07T20:26:09.9320658Z 
2025-05-07T20:26:09.9326518Z libcusolver-11.7.2.5 | 156.9 MB  | ####3      |  44% [A[A[A
2025-05-07T20:26:09.9326792Z 
2025-05-07T20:26:09.9326797Z 
2025-05-07T20:26:09.9326801Z 
2025-05-07T20:26:09.9329105Z 
2025-05-07T20:26:09.9480151Z libcufft-11.3.3.41   | 147.4 MB  | ####8      |  49% [A[A[A[A
2025-05-07T20:26:09.9483891Z 
2025-05-07T20:26:09.9988880Z nsight-compute-2025. | 320.6 MB  | ##4        |  24% [A
2025-05-07T20:26:09.9989194Z 
2025-05-07T20:26:09.9989200Z 
2025-05-07T20:26:10.0070947Z libcusparse-12.5.7.5 | 164.9 MB  | #####3     |  53% [A[A
2025-05-07T20:26:10.0318875Z libcublas-12.8.3.14  | 460.2 MB  | #8         |  18% 
2025-05-07T20:26:10.0319197Z 
2025-05-07T20:26:10.0319204Z 
2025-05-07T20:26:10.0321591Z 
2025-05-07T20:26:10.0376561Z libcusolver-11.7.2.5 | 156.9 MB  | ####5      |  46% [A[A[A
2025-05-07T20:26:10.0376888Z 
2025-05-07T20:26:10.0376894Z 
2025-05-07T20:26:10.0376900Z 
2025-05-07T20:26:10.0377688Z 
2025-05-07T20:26:10.0482856Z libcufft-11.3.3.41   | 147.4 MB  | #####1     |  51% [A[A[A[A
2025-05-07T20:26:10.0483144Z 
2025-05-07T20:26:10.0992237Z nsight-compute-2025. | 320.6 MB  | ##5        |  25% [A
2025-05-07T20:26:10.0992517Z 
2025-05-07T20:26:10.0992521Z 
2025-05-07T20:26:10.1074538Z libcusparse-12.5.7.5 | 164.9 MB  | #####5     |  55% [A[A
2025-05-07T20:26:10.1322122Z libcublas-12.8.3.14  | 460.2 MB  | #9         |  19% 
2025-05-07T20:26:10.1322373Z 
2025-05-07T20:26:10.1322378Z 
2025-05-07T20:26:10.1324852Z 
2025-05-07T20:26:10.1483580Z libcusolver-11.7.2.5 | 156.9 MB  | ####8      |  49% [A[A[A
2025-05-07T20:26:10.1483852Z 
2025-05-07T20:26:10.1996909Z nsight-compute-2025. | 320.6 MB  | ##6        |  27% [A
2025-05-07T20:26:10.1997216Z 
2025-05-07T20:26:10.1997906Z 
2025-05-07T20:26:10.2008313Z libcusparse-12.5.7.5 | 164.9 MB  | #####7     |  58% [A[A
2025-05-07T20:26:10.2008580Z 
2025-05-07T20:26:10.2008585Z 
2025-05-07T20:26:10.2008589Z 
2025-05-07T20:26:10.2010919Z 
2025-05-07T20:26:10.2092276Z libcufft-11.3.3.41   | 147.4 MB  | #####3     |  53% [A[A[A[A
2025-05-07T20:26:10.2346067Z libcublas-12.8.3.14  | 460.2 MB  | ##         |  20% 
2025-05-07T20:26:10.2346332Z 
2025-05-07T20:26:10.2346337Z 
2025-05-07T20:26:10.2348351Z 
2025-05-07T20:26:10.2487379Z libcusolver-11.7.2.5 | 156.9 MB  | #####      |  51% [A[A[A
2025-05-07T20:26:10.2487763Z 
2025-05-07T20:26:10.2999863Z nsight-compute-2025. | 320.6 MB  | ##7        |  28% [A
2025-05-07T20:26:10.3000222Z 
2025-05-07T20:26:10.3003031Z 
2025-05-07T20:26:10.3008741Z libcusparse-12.5.7.5 | 164.9 MB  | #####9     |  60% [A[A
2025-05-07T20:26:10.3009019Z 
2025-05-07T20:26:10.3009023Z 
2025-05-07T20:26:10.3009053Z 
2025-05-07T20:26:10.3011191Z 
2025-05-07T20:26:10.3188381Z libcufft-11.3.3.41   | 147.4 MB  | #####5     |  55% [A[A[A[A
2025-05-07T20:26:10.3350261Z libcublas-12.8.3.14  | 460.2 MB  | ##         |  21% 
2025-05-07T20:26:10.3350722Z 
2025-05-07T20:26:10.3350729Z 
2025-05-07T20:26:10.3352193Z 
2025-05-07T20:26:10.3489774Z libcusolver-11.7.2.5 | 156.9 MB  | #####3     |  53% [A[A[A
2025-05-07T20:26:10.3490075Z 
2025-05-07T20:26:10.4013315Z nsight-compute-2025. | 320.6 MB  | ##8        |  29% [A
2025-05-07T20:26:10.4013700Z 
2025-05-07T20:26:10.4013704Z 
2025-05-07T20:26:10.4013708Z 
2025-05-07T20:26:10.4015503Z 
2025-05-07T20:26:10.4034278Z libcufft-11.3.3.41   | 147.4 MB  | #####7     |  58% [A[A[A[A
2025-05-07T20:26:10.4034557Z 
2025-05-07T20:26:10.4035795Z 
2025-05-07T20:26:10.4249012Z libcusparse-12.5.7.5 | 164.9 MB  | ######2    |  62% [A[A
2025-05-07T20:26:10.4355921Z libcublas-12.8.3.14  | 460.2 MB  | ##1        |  22% 
2025-05-07T20:26:10.4356168Z 
2025-05-07T20:26:10.4356330Z 
2025-05-07T20:26:10.4358761Z 
2025-05-07T20:26:10.4494655Z libcusolver-11.7.2.5 | 156.9 MB  | #####5     |  56% [A[A[A
2025-05-07T20:26:10.4495580Z 
2025-05-07T20:26:10.5014389Z nsight-compute-2025. | 320.6 MB  | ###        |  30% [A
2025-05-07T20:26:10.5014809Z 
2025-05-07T20:26:10.5014815Z 
2025-05-07T20:26:10.5014820Z 
2025-05-07T20:26:10.5015165Z 
2025-05-07T20:26:10.5068118Z libcufft-11.3.3.41   | 147.4 MB  | ######     |  60% [A[A[A[A
2025-05-07T20:26:10.5068410Z 
2025-05-07T20:26:10.5068414Z 
2025-05-07T20:26:10.5287232Z libcusparse-12.5.7.5 | 164.9 MB  | ######4    |  64% [A[A
2025-05-07T20:26:10.5373140Z libcublas-12.8.3.14  | 460.2 MB  | ##2        |  22% 
2025-05-07T20:26:10.5373458Z 
2025-05-07T20:26:10.5373463Z 
2025-05-07T20:26:10.5374130Z 
2025-05-07T20:26:10.5497735Z libcusolver-11.7.2.5 | 156.9 MB  | #####8     |  58% [A[A[A
2025-05-07T20:26:10.5499367Z 
2025-05-07T20:26:10.6018842Z nsight-compute-2025. | 320.6 MB  | ###1       |  31% [A
2025-05-07T20:26:10.6019291Z 
2025-05-07T20:26:10.6019298Z 
2025-05-07T20:26:10.6019303Z 
2025-05-07T20:26:10.6020031Z 
2025-05-07T20:26:10.6082142Z libcufft-11.3.3.41   | 147.4 MB  | ######2    |  63% [A[A[A[A
2025-05-07T20:26:10.6082424Z 
2025-05-07T20:26:10.6082429Z 
2025-05-07T20:26:10.6315703Z libcusparse-12.5.7.5 | 164.9 MB  | ######6    |  67% [A[A
2025-05-07T20:26:10.6458985Z libcublas-12.8.3.14  | 460.2 MB  | ##3        |  23% 
2025-05-07T20:26:10.6459251Z 
2025-05-07T20:26:10.6459256Z 
2025-05-07T20:26:10.6460838Z 
2025-05-07T20:26:10.6498306Z libcusolver-11.7.2.5 | 156.9 MB  | ######     |  60% [A[A[A
2025-05-07T20:26:10.6498597Z 
2025-05-07T20:26:10.7082502Z nsight-compute-2025. | 320.6 MB  | ###2       |  33% [A
2025-05-07T20:26:10.7082779Z 
2025-05-07T20:26:10.7082784Z 
2025-05-07T20:26:10.7317096Z libcusparse-12.5.7.5 | 164.9 MB  | ######9    |  69% [A[A
2025-05-07T20:26:10.7460794Z libcublas-12.8.3.14  | 460.2 MB  | ##3        |  24% 
2025-05-07T20:26:10.7461119Z 
2025-05-07T20:26:10.7461123Z 
2025-05-07T20:26:10.7462817Z 
2025-05-07T20:26:10.7499940Z libcusolver-11.7.2.5 | 156.9 MB  | ######2    |  63% [A[A[A
2025-05-07T20:26:10.7500220Z 
2025-05-07T20:26:10.8082641Z nsight-compute-2025. | 320.6 MB  | ###3       |  34% [A
2025-05-07T20:26:10.8082925Z 
2025-05-07T20:26:10.8082929Z 
2025-05-07T20:26:10.8189540Z libcusparse-12.5.7.5 | 164.9 MB  | #######1   |  72% [A[A
2025-05-07T20:26:10.8190408Z 
2025-05-07T20:26:10.8190416Z 
2025-05-07T20:26:10.8190421Z 
2025-05-07T20:26:10.8190426Z 
2025-05-07T20:26:10.8322031Z libcufft-11.3.3.41   | 147.4 MB  | ######5    |  65% [A[A[A[A
2025-05-07T20:26:10.8464825Z libcublas-12.8.3.14  | 460.2 MB  | ##4        |  25% 
2025-05-07T20:26:10.8465189Z 
2025-05-07T20:26:10.8465196Z 
2025-05-07T20:26:10.8467208Z 
2025-05-07T20:26:10.8500710Z libcusolver-11.7.2.5 | 156.9 MB  | ######5    |  65% [A[A[A
2025-05-07T20:26:10.8501745Z 
2025-05-07T20:26:10.9113408Z nsight-compute-2025. | 320.6 MB  | ###5       |  35% [A
2025-05-07T20:26:10.9113709Z 
2025-05-07T20:26:10.9114122Z 
2025-05-07T20:26:10.9258633Z libcusparse-12.5.7.5 | 164.9 MB  | #######4   |  74% [A[A
2025-05-07T20:26:10.9258985Z 
2025-05-07T20:26:10.9258990Z 
2025-05-07T20:26:10.9258994Z 
2025-05-07T20:26:10.9258997Z 
2025-05-07T20:26:10.9324118Z libcufft-11.3.3.41   | 147.4 MB  | ######6    |  67% [A[A[A[A
2025-05-07T20:26:10.9496033Z libcublas-12.8.3.14  | 460.2 MB  | ##5        |  26% 
2025-05-07T20:26:10.9496430Z 
2025-05-07T20:26:10.9496437Z 
2025-05-07T20:26:10.9496442Z 
2025-05-07T20:26:10.9508411Z libcusolver-11.7.2.5 | 156.9 MB  | ######7    |  68% [A[A[A
2025-05-07T20:26:10.9509327Z 
2025-05-07T20:26:11.0260335Z nsight-compute-2025. | 320.6 MB  | ###6       |  36% [A
2025-05-07T20:26:11.0260608Z 
2025-05-07T20:26:11.0260612Z 
2025-05-07T20:26:11.0260616Z 
2025-05-07T20:26:11.0260620Z 
2025-05-07T20:26:11.0326764Z libcufft-11.3.3.41   | 147.4 MB  | ######9    |  69% [A[A[A[A
2025-05-07T20:26:11.0632093Z libcublas-12.8.3.14  | 460.2 MB  | ##6        |  26% 
2025-05-07T20:26:11.0632538Z 
2025-05-07T20:26:11.0632572Z 
2025-05-07T20:26:11.0640638Z libcusparse-12.5.7.5 | 164.9 MB  | #######6   |  77% [A[A
2025-05-07T20:26:11.0640908Z 
2025-05-07T20:26:11.0640913Z 
2025-05-07T20:26:11.0643585Z 
2025-05-07T20:26:11.0872470Z libcusolver-11.7.2.5 | 156.9 MB  | #######    |  70% [A[A[A
2025-05-07T20:26:11.0872745Z 
2025-05-07T20:26:11.1330661Z nsight-compute-2025. | 320.6 MB  | ###7       |  38% [A
2025-05-07T20:26:11.1340859Z libcublas-12.8.3.14  | 460.2 MB  | ##7        |  27% 
2025-05-07T20:26:11.1341116Z 
2025-05-07T20:26:11.1341122Z 
2025-05-07T20:26:11.1341150Z 
2025-05-07T20:26:11.1342871Z 
2025-05-07T20:26:11.1633715Z libcufft-11.3.3.41   | 147.4 MB  | #######1   |  71% [A[A[A[A
2025-05-07T20:26:11.1634004Z 
2025-05-07T20:26:11.1634008Z 
2025-05-07T20:26:11.1640799Z libcusparse-12.5.7.5 | 164.9 MB  | #######9   |  79% [A[A
2025-05-07T20:26:11.1641143Z 
2025-05-07T20:26:11.1641149Z 
2025-05-07T20:26:11.1641917Z 
2025-05-07T20:26:11.2371660Z libcusolver-11.7.2.5 | 156.9 MB  | #######2   |  73% [A[A[A
2025-05-07T20:26:11.2372077Z 
2025-05-07T20:26:11.2372085Z 
2025-05-07T20:26:11.2372090Z 
2025-05-07T20:26:11.2372096Z 
2025-05-07T20:26:11.2374418Z libcufft-11.3.3.41   | 147.4 MB  | #######3   |  74% [A[A[A[A
2025-05-07T20:26:11.2655090Z libcublas-12.8.3.14  | 460.2 MB  | ##8        |  28% 
2025-05-07T20:26:11.2655353Z 
2025-05-07T20:26:11.2699746Z nsight-compute-2025. | 320.6 MB  | ###8       |  39% [A
2025-05-07T20:26:11.2700082Z 
2025-05-07T20:26:11.2700086Z 
2025-05-07T20:26:11.2820400Z libcusparse-12.5.7.5 | 164.9 MB  | ########1  |  81% [A[A
2025-05-07T20:26:11.2820783Z 
2025-05-07T20:26:11.2820787Z 
2025-05-07T20:26:11.2821408Z 
2025-05-07T20:26:11.3374335Z libcusolver-11.7.2.5 | 156.9 MB  | #######5   |  75% [A[A[A
2025-05-07T20:26:11.3374779Z 
2025-05-07T20:26:11.3374786Z 
2025-05-07T20:26:11.3374792Z 
2025-05-07T20:26:11.3375307Z 
2025-05-07T20:26:11.3469392Z libcufft-11.3.3.41   | 147.4 MB  | #######5   |  76% [A[A[A[A
2025-05-07T20:26:11.3661197Z libcublas-12.8.3.14  | 460.2 MB  | ##8        |  29% 
2025-05-07T20:26:11.3661497Z 
2025-05-07T20:26:11.3702326Z nsight-compute-2025. | 320.6 MB  | ###9       |  40% [A
2025-05-07T20:26:11.3702604Z 
2025-05-07T20:26:11.3702814Z 
2025-05-07T20:26:11.4376226Z libcusparse-12.5.7.5 | 164.9 MB  | ########3  |  84% [A[A
2025-05-07T20:26:11.4376593Z 
2025-05-07T20:26:11.4376598Z 
2025-05-07T20:26:11.4376625Z 
2025-05-07T20:26:11.4376629Z 
2025-05-07T20:26:11.4471900Z libcufft-11.3.3.41   | 147.4 MB  | #######8   |  78% [A[A[A[A
2025-05-07T20:26:11.4496117Z libcublas-12.8.3.14  | 460.2 MB  | ##9        |  30% 
2025-05-07T20:26:11.4496461Z 
2025-05-07T20:26:11.4496468Z 
2025-05-07T20:26:11.4498449Z 
2025-05-07T20:26:11.4663360Z libcusolver-11.7.2.5 | 156.9 MB  | #######7   |  77% [A[A[A
2025-05-07T20:26:11.4667675Z 
2025-05-07T20:26:11.4704437Z nsight-compute-2025. | 320.6 MB  | ####1      |  41% [A
2025-05-07T20:26:11.4704704Z 
2025-05-07T20:26:11.4706580Z 
2025-05-07T20:26:11.5377233Z libcusparse-12.5.7.5 | 164.9 MB  | ########6  |  87% [A[A
2025-05-07T20:26:11.5377660Z 
2025-05-07T20:26:11.5377678Z 
2025-05-07T20:26:11.5377684Z 
2025-05-07T20:26:11.5377961Z 
2025-05-07T20:26:11.5498305Z libcufft-11.3.3.41   | 147.4 MB  | ########   |  81% [A[A[A[A
2025-05-07T20:26:11.5498629Z 
2025-05-07T20:26:11.5498644Z 
2025-05-07T20:26:11.5499239Z 
2025-05-07T20:26:11.5557096Z libcusolver-11.7.2.5 | 156.9 MB  | #######9   |  79% [A[A[A
2025-05-07T20:26:11.5663674Z libcublas-12.8.3.14  | 460.2 MB  | ###        |  31% 
2025-05-07T20:26:11.5665272Z 
2025-05-07T20:26:11.5758945Z nsight-compute-2025. | 320.6 MB  | ####2      |  42% [A
2025-05-07T20:26:11.5759203Z 
2025-05-07T20:26:11.5759207Z 
2025-05-07T20:26:11.6379891Z libcusparse-12.5.7.5 | 164.9 MB  | ########8  |  89% [A[A
2025-05-07T20:26:11.6380204Z 
2025-05-07T20:26:11.6380209Z 
2025-05-07T20:26:11.6380214Z 
2025-05-07T20:26:11.6380576Z 
2025-05-07T20:26:11.6501067Z libcufft-11.3.3.41   | 147.4 MB  | ########3  |  83% [A[A[A[A
2025-05-07T20:26:11.6501483Z 
2025-05-07T20:26:11.6501517Z 
2025-05-07T20:26:11.6502453Z 
2025-05-07T20:26:11.6559601Z libcusolver-11.7.2.5 | 156.9 MB  | ########1  |  82% [A[A[A
2025-05-07T20:26:11.6686678Z libcublas-12.8.3.14  | 460.2 MB  | ###1       |  31% 
2025-05-07T20:26:11.6687991Z 
2025-05-07T20:26:11.6833065Z nsight-compute-2025. | 320.6 MB  | ####3      |  44% [A
2025-05-07T20:26:11.6833464Z 
2025-05-07T20:26:11.6834388Z 
2025-05-07T20:26:11.7380366Z libcusparse-12.5.7.5 | 164.9 MB  | #########1 |  91% [A[A
2025-05-07T20:26:11.7380653Z 
2025-05-07T20:26:11.7380661Z 
2025-05-07T20:26:11.7380666Z 
2025-05-07T20:26:11.7380671Z 
2025-05-07T20:26:11.7503325Z libcufft-11.3.3.41   | 147.4 MB  | ########5  |  86% [A[A[A[A
2025-05-07T20:26:11.7503611Z 
2025-05-07T20:26:11.7503615Z 
2025-05-07T20:26:11.7504284Z 
2025-05-07T20:26:11.7563478Z libcusolver-11.7.2.5 | 156.9 MB  | ########3  |  84% [A[A[A
2025-05-07T20:26:11.7717161Z libcublas-12.8.3.14  | 460.2 MB  | ###2       |  32% 
2025-05-07T20:26:11.7719674Z 
2025-05-07T20:26:11.7877908Z nsight-compute-2025. | 320.6 MB  | ####4      |  45% [A
2025-05-07T20:26:11.7878210Z 
2025-05-07T20:26:11.7879840Z 
2025-05-07T20:26:11.8381078Z libcusparse-12.5.7.5 | 164.9 MB  | #########3 |  94% [A[A
2025-05-07T20:26:11.8381369Z 
2025-05-07T20:26:11.8381384Z 
2025-05-07T20:26:11.8381388Z 
2025-05-07T20:26:11.8382665Z 
2025-05-07T20:26:11.8509501Z libcufft-11.3.3.41   | 147.4 MB  | ########8  |  88% [A[A[A[A
2025-05-07T20:26:11.8510045Z 
2025-05-07T20:26:11.8510056Z 
2025-05-07T20:26:11.8510767Z 
2025-05-07T20:26:11.8718147Z libcusolver-11.7.2.5 | 156.9 MB  | ########6  |  86% [A[A[A
2025-05-07T20:26:11.8722236Z libcublas-12.8.3.14  | 460.2 MB  | ###3       |  33% 
2025-05-07T20:26:11.8723924Z 
2025-05-07T20:26:11.8922715Z nsight-compute-2025. | 320.6 MB  | ####5      |  46% [A
2025-05-07T20:26:11.8923122Z 
2025-05-07T20:26:11.8923931Z 
2025-05-07T20:26:11.9390351Z libcusparse-12.5.7.5 | 164.9 MB  | #########5 |  96% [A[A
2025-05-07T20:26:11.9390748Z 
2025-05-07T20:26:11.9390754Z 
2025-05-07T20:26:11.9391022Z 
2025-05-07T20:26:11.9392931Z 
2025-05-07T20:26:11.9510889Z libcufft-11.3.3.41   | 147.4 MB  | #########  |  91% [A[A[A[A
2025-05-07T20:26:11.9511255Z 
2025-05-07T20:26:11.9511259Z 
2025-05-07T20:26:11.9511953Z 
2025-05-07T20:26:11.9747575Z libcusolver-11.7.2.5 | 156.9 MB  | ########8  |  88% [A[A[A
2025-05-07T20:26:11.9802558Z libcublas-12.8.3.14  | 460.2 MB  | ###3       |  34% 
2025-05-07T20:26:11.9803081Z 
2025-05-07T20:26:11.9924560Z nsight-compute-2025. | 320.6 MB  | ####6      |  47% [A
2025-05-07T20:26:11.9924841Z 
2025-05-07T20:26:11.9925516Z 
2025-05-07T20:26:12.0394447Z libcusparse-12.5.7.5 | 164.9 MB  | #########8 |  98% [A[A
2025-05-07T20:26:12.0394729Z 
2025-05-07T20:26:12.0394736Z 
2025-05-07T20:26:12.0394742Z 
2025-05-07T20:26:12.0395311Z 
2025-05-07T20:26:12.0512103Z libcufft-11.3.3.41   | 147.4 MB  | #########3 |  93% [A[A[A[A
2025-05-07T20:26:12.0512372Z 
2025-05-07T20:26:12.0512378Z 
2025-05-07T20:26:12.0512803Z 
2025-05-07T20:26:12.0754052Z libcusolver-11.7.2.5 | 156.9 MB  | #########  |  91% [A[A[A
2025-05-07T20:26:12.0803647Z libcublas-12.8.3.14  | 460.2 MB  | ###4       |  35% 
2025-05-07T20:26:12.0805250Z 
2025-05-07T20:26:12.1512752Z nsight-compute-2025. | 320.6 MB  | ####8      |  48% [A
2025-05-07T20:26:12.1513084Z 
2025-05-07T20:26:12.1513089Z 
2025-05-07T20:26:12.1514581Z 
2025-05-07T20:26:12.1754377Z libcusolver-11.7.2.5 | 156.9 MB  | #########3 |  94% [A[A[A
2025-05-07T20:26:12.1847370Z libcublas-12.8.3.14  | 460.2 MB  | ###5       |  36% 
2025-05-07T20:26:12.1849375Z 
2025-05-07T20:26:12.2069853Z nsight-compute-2025. | 320.6 MB  | ####9      |  49% [A
2025-05-07T20:26:12.2070226Z 
2025-05-07T20:26:12.2070232Z 
2025-05-07T20:26:12.2070238Z 
2025-05-07T20:26:12.2070244Z 
2025-05-07T20:26:12.2513069Z libcufft-11.3.3.41   | 147.4 MB  | #########5 |  96% [A[A[A[A
2025-05-07T20:26:12.2513348Z 
2025-05-07T20:26:12.2513354Z 
2025-05-07T20:26:12.2514018Z 
2025-05-07T20:26:12.2848369Z libcusolver-11.7.2.5 | 156.9 MB  | #########6 |  97% [A[A[A
2025-05-07T20:26:12.2848688Z 
2025-05-07T20:26:12.3128913Z nsight-compute-2025. | 320.6 MB  | #####      |  51% [A
2025-05-07T20:26:12.3129183Z 
2025-05-07T20:26:12.3129190Z 
2025-05-07T20:26:12.3129195Z 
2025-05-07T20:26:12.3129199Z 
2025-05-07T20:26:12.3513183Z libcufft-11.3.3.41   | 147.4 MB  | #########8 |  98% [A[A[A[A
2025-05-07T20:26:12.3513539Z 
2025-05-07T20:26:12.3513576Z 
2025-05-07T20:26:12.3513585Z 
2025-05-07T20:26:12.3662204Z libcusolver-11.7.2.5 | 156.9 MB  | #########9 |  99% [A[A[A
2025-05-07T20:26:12.3912671Z libcublas-12.8.3.14  | 460.2 MB  | ###6       |  37% 
2025-05-07T20:26:12.3914539Z 
2025-05-07T20:26:12.4818546Z nsight-compute-2025. | 320.6 MB  | #####2     |  52% [A
2025-05-07T20:26:12.4913559Z libcublas-12.8.3.14  | 460.2 MB  | ###7       |  37% 
2025-05-07T20:26:12.4915381Z 
2025-05-07T20:26:12.5820350Z nsight-compute-2025. | 320.6 MB  | #####3     |  54% [A
2025-05-07T20:26:12.5913672Z libcublas-12.8.3.14  | 460.2 MB  | ###8       |  39% 
2025-05-07T20:26:12.5914567Z 
2025-05-07T20:26:12.6914403Z nsight-compute-2025. | 320.6 MB  | #####5     |  55% [A
2025-05-07T20:26:12.6915247Z 
2025-05-07T20:26:12.7555649Z nsight-compute-2025. | 320.6 MB  | #####7     |  58% [A
2025-05-07T20:26:12.7914401Z libcublas-12.8.3.14  | 460.2 MB  | ###9       |  39% 
2025-05-07T20:26:12.7916258Z 
2025-05-07T20:26:12.8558650Z nsight-compute-2025. | 320.6 MB  | #####9     |  60% [A
2025-05-07T20:26:12.9013916Z libcublas-12.8.3.14  | 460.2 MB  | ####       |  41% 
2025-05-07T20:26:12.9014601Z 
2025-05-07T20:26:12.9614990Z nsight-compute-2025. | 320.6 MB  | ######1    |  61% [A
2025-05-07T20:26:13.0017151Z libcublas-12.8.3.14  | 460.2 MB  | ####1      |  42% 
2025-05-07T20:26:13.0017650Z 
2025-05-07T20:26:13.0615277Z nsight-compute-2025. | 320.6 MB  | ######3    |  63% [A
2025-05-07T20:26:13.1161517Z libcublas-12.8.3.14  | 460.2 MB  | ####2      |  43% 
2025-05-07T20:26:13.1163327Z 
2025-05-07T20:26:13.1617552Z nsight-compute-2025. | 320.6 MB  | ######5    |  65% [A
2025-05-07T20:26:13.2215488Z libcublas-12.8.3.14  | 460.2 MB  | ####3      |  44% 
2025-05-07T20:26:13.2216824Z 
2025-05-07T20:26:13.2620308Z nsight-compute-2025. | 320.6 MB  | ######6    |  67% [A
2025-05-07T20:26:13.3278701Z libcublas-12.8.3.14  | 460.2 MB  | ####4      |  45% 
2025-05-07T20:26:13.3278995Z 
2025-05-07T20:26:13.3621476Z nsight-compute-2025. | 320.6 MB  | ######8    |  68% [A
2025-05-07T20:26:13.4279743Z libcublas-12.8.3.14  | 460.2 MB  | ####5      |  46% 
2025-05-07T20:26:13.4281109Z 
2025-05-07T20:26:13.4981802Z nsight-compute-2025. | 320.6 MB  | #######    |  70% [A
2025-05-07T20:26:13.5282293Z libcublas-12.8.3.14  | 460.2 MB  | ####7      |  47% 
2025-05-07T20:26:13.5284457Z 
2025-05-07T20:26:13.5984422Z nsight-compute-2025. | 320.6 MB  | #######1   |  72% [A
2025-05-07T20:26:13.6391944Z libcublas-12.8.3.14  | 460.2 MB  | ####8      |  48% 
2025-05-07T20:26:13.6394956Z 
2025-05-07T20:26:13.7393033Z nsight-compute-2025. | 320.6 MB  | #######3   |  74% [A
2025-05-07T20:26:13.7393807Z 
2025-05-07T20:26:13.7606411Z nsight-compute-2025. | 320.6 MB  | #######6   |  76% [A
2025-05-07T20:26:13.8608895Z libcublas-12.8.3.14  | 460.2 MB  | ####9      |  49% 
2025-05-07T20:26:13.8617799Z libcublas-12.8.3.14  | 460.2 MB  | #####      |  50% 
2025-05-07T20:26:13.8619492Z 
2025-05-07T20:26:13.9609375Z nsight-compute-2025. | 320.6 MB  | #######7   |  78% [A
2025-05-07T20:26:13.9785371Z libcublas-12.8.3.14  | 460.2 MB  | #####1     |  51% 
2025-05-07T20:26:13.9785663Z 
2025-05-07T20:26:13.9964221Z nsight-compute-2025. | 320.6 MB  | #######9   |  80% [A
2025-05-07T20:26:13.9964538Z 
2025-05-07T20:26:13.9969224Z 
2025-05-07T20:26:14.0421756Z libcusparse-12.5.7.5 | 164.9 MB  | ########## | 100% [A[A
2025-05-07T20:26:14.0422047Z 
2025-05-07T20:26:14.0422052Z 
2025-05-07T20:26:14.0422056Z 
2025-05-07T20:26:14.0422060Z 
2025-05-07T20:26:14.0425525Z 
2025-05-07T20:26:14.0612158Z libnpp-12.3.3.65     | 130.6 MB  |            |   0% [A[A[A[A[A
2025-05-07T20:26:14.1193351Z libcublas-12.8.3.14  | 460.2 MB  | #####2     |  52% 
2025-05-07T20:26:14.1196354Z 
2025-05-07T20:26:14.1427094Z nsight-compute-2025. | 320.6 MB  | ########1  |  81% [A
2025-05-07T20:26:14.1427363Z 
2025-05-07T20:26:14.1427368Z 
2025-05-07T20:26:14.1427372Z 
2025-05-07T20:26:14.1427383Z 
2025-05-07T20:26:14.1427387Z 
2025-05-07T20:26:14.1877222Z libnpp-12.3.3.65     | 130.6 MB  | 2          |   3% [A[A[A[A[A
2025-05-07T20:26:14.2428203Z libcublas-12.8.3.14  | 460.2 MB  | #####3     |  53% 
2025-05-07T20:26:14.2428463Z 
2025-05-07T20:26:14.2428467Z 
2025-05-07T20:26:14.2428471Z 
2025-05-07T20:26:14.2428475Z 
2025-05-07T20:26:14.2431754Z 
2025-05-07T20:26:14.2655753Z libnpp-12.3.3.65     | 130.6 MB  | 5          |   6% [A[A[A[A[A
2025-05-07T20:26:14.2656095Z 
2025-05-07T20:26:14.3010377Z nsight-compute-2025. | 320.6 MB  | ########2  |  83% [A
2025-05-07T20:26:14.3429314Z libcublas-12.8.3.14  | 460.2 MB  | #####4     |  54% 
2025-05-07T20:26:14.3429664Z 
2025-05-07T20:26:14.3429672Z 
2025-05-07T20:26:14.3429678Z 
2025-05-07T20:26:14.3429683Z 
2025-05-07T20:26:14.3429718Z 
2025-05-07T20:26:14.4000194Z libnpp-12.3.3.65     | 130.6 MB  | 8          |   9% [A[A[A[A[A
2025-05-07T20:26:14.4001075Z 
2025-05-07T20:26:14.4237407Z nsight-compute-2025. | 320.6 MB  | ########4  |  84% [A
2025-05-07T20:26:14.4430177Z libcublas-12.8.3.14  | 460.2 MB  | #####5     |  55% 
2025-05-07T20:26:14.4430498Z 
2025-05-07T20:26:14.4430787Z 
2025-05-07T20:26:14.4430792Z 
2025-05-07T20:26:14.4430797Z 
2025-05-07T20:26:14.4432697Z 
2025-05-07T20:26:14.4879446Z libnpp-12.3.3.65     | 130.6 MB  | #1         |  12% [A[A[A[A[A
2025-05-07T20:26:14.4879735Z 
2025-05-07T20:26:14.4879740Z 
2025-05-07T20:26:14.4879744Z 
2025-05-07T20:26:14.4883466Z 
2025-05-07T20:26:14.5282151Z libcufft-11.3.3.41   | 147.4 MB  | ########## | 100% [A[A[A[A
2025-05-07T20:26:14.5282437Z 
2025-05-07T20:26:14.5416028Z nsight-compute-2025. | 320.6 MB  | ########5  |  86% [A
2025-05-07T20:26:14.5434289Z libcublas-12.8.3.14  | 460.2 MB  | #####5     |  56% 
2025-05-07T20:26:14.5434546Z 
2025-05-07T20:26:14.5434790Z 
2025-05-07T20:26:14.5434796Z 
2025-05-07T20:26:14.5434800Z 
2025-05-07T20:26:14.5438448Z 
2025-05-07T20:26:14.5579040Z libnpp-12.3.3.65     | 130.6 MB  | #4         |  14% [A[A[A[A[A
2025-05-07T20:26:14.5579335Z 
2025-05-07T20:26:14.5579340Z 
2025-05-07T20:26:14.5579343Z 
2025-05-07T20:26:14.5579347Z 
2025-05-07T20:26:14.5579358Z 
2025-05-07T20:26:14.5579375Z 
2025-05-07T20:26:14.6581289Z cuda-nsight-12.8.55  | 113.2 MB  |            |   0% [A[A[A[A[A[A
2025-05-07T20:26:14.6581620Z 
2025-05-07T20:26:14.6581624Z 
2025-05-07T20:26:14.6581629Z 
2025-05-07T20:26:14.6581643Z 
2025-05-07T20:26:14.6581647Z 
2025-05-07T20:26:14.6581651Z 
2025-05-07T20:26:14.6594954Z cuda-nsight-12.8.55  | 113.2 MB  | 2          |   3% [A[A[A[A[A[A
2025-05-07T20:26:14.6668803Z libcublas-12.8.3.14  | 460.2 MB  | #####6     |  57% 
2025-05-07T20:26:14.6669046Z 
2025-05-07T20:26:14.6855423Z nsight-compute-2025. | 320.6 MB  | ########6  |  87% [A
2025-05-07T20:26:14.6855686Z 
2025-05-07T20:26:14.6855898Z 
2025-05-07T20:26:14.6855929Z 
2025-05-07T20:26:14.6855936Z 
2025-05-07T20:26:14.6855963Z 
2025-05-07T20:26:14.7584526Z libnpp-12.3.3.65     | 130.6 MB  | #6         |  17% [A[A[A[A[A
2025-05-07T20:26:14.7584852Z 
2025-05-07T20:26:14.7584856Z 
2025-05-07T20:26:14.7584861Z 
2025-05-07T20:26:14.7584866Z 
2025-05-07T20:26:14.7584871Z 
2025-05-07T20:26:14.7584885Z 
2025-05-07T20:26:14.7883940Z cuda-nsight-12.8.55  | 113.2 MB  | 5          |   5% [A[A[A[A[A[A
2025-05-07T20:26:14.8043223Z libcublas-12.8.3.14  | 460.2 MB  | #####7     |  57% 
2025-05-07T20:26:14.8045285Z 
2025-05-07T20:26:14.8168849Z nsight-compute-2025. | 320.6 MB  | ########7  |  88% [A
2025-05-07T20:26:14.8169123Z 
2025-05-07T20:26:14.8169129Z 
2025-05-07T20:26:14.8169133Z 
2025-05-07T20:26:14.8169138Z 
2025-05-07T20:26:14.8169146Z 
2025-05-07T20:26:14.8584257Z libnpp-12.3.3.65     | 130.6 MB  | #9         |  19% [A[A[A[A[A
2025-05-07T20:26:14.8584539Z 
2025-05-07T20:26:14.8584544Z 
2025-05-07T20:26:14.8584548Z 
2025-05-07T20:26:14.8584569Z 
2025-05-07T20:26:14.8584573Z 
2025-05-07T20:26:14.8584863Z 
2025-05-07T20:26:14.8975485Z cuda-nsight-12.8.55  | 113.2 MB  | 7          |   8% [A[A[A[A[A[A
2025-05-07T20:26:14.9297078Z libcublas-12.8.3.14  | 460.2 MB  | #####8     |  58% 
2025-05-07T20:26:14.9297354Z 
2025-05-07T20:26:14.9297360Z 
2025-05-07T20:26:14.9297365Z 
2025-05-07T20:26:14.9297400Z 
2025-05-07T20:26:14.9299135Z 
2025-05-07T20:26:14.9387146Z libnpp-12.3.3.65     | 130.6 MB  | ##1        |  21% [A[A[A[A[A
2025-05-07T20:26:14.9392044Z 
2025-05-07T20:26:14.9588595Z nsight-compute-2025. | 320.6 MB  | ########9  |  89% [A
2025-05-07T20:26:14.9588867Z 
2025-05-07T20:26:14.9588873Z 
2025-05-07T20:26:14.9588887Z 
2025-05-07T20:26:14.9588894Z 
2025-05-07T20:26:14.9588898Z 
2025-05-07T20:26:14.9588902Z 
2025-05-07T20:26:15.0154432Z cuda-nsight-12.8.55  | 113.2 MB  | #          |  10% [A[A[A[A[A[A
2025-05-07T20:26:15.0415485Z libcublas-12.8.3.14  | 460.2 MB  | #####8     |  59% 
2025-05-07T20:26:15.0415749Z 
2025-05-07T20:26:15.0416109Z 
2025-05-07T20:26:15.0416122Z 
2025-05-07T20:26:15.0416129Z 
2025-05-07T20:26:15.0416135Z 
2025-05-07T20:26:15.0592891Z libnpp-12.3.3.65     | 130.6 MB  | ##3        |  24% [A[A[A[A[A
2025-05-07T20:26:15.0593209Z 
2025-05-07T20:26:15.0593215Z 
2025-05-07T20:26:15.0593219Z 
2025-05-07T20:26:15.0593223Z 
2025-05-07T20:26:15.0593229Z 
2025-05-07T20:26:15.0593970Z 
2025-05-07T20:26:15.0638792Z cuda-nsight-12.8.55  | 113.2 MB  | #2         |  13% [A[A[A[A[A[A
2025-05-07T20:26:15.0640879Z 
2025-05-07T20:26:15.1172272Z nsight-compute-2025. | 320.6 MB  | ########9  |  90% [A
2025-05-07T20:26:15.1489499Z libcublas-12.8.3.14  | 460.2 MB  | #####9     |  60% 
2025-05-07T20:26:15.1489856Z 
2025-05-07T20:26:15.1490036Z 
2025-05-07T20:26:15.1490042Z 
2025-05-07T20:26:15.1490151Z 
2025-05-07T20:26:15.1491675Z 
2025-05-07T20:26:15.1600465Z libnpp-12.3.3.65     | 130.6 MB  | ##5        |  26% [A[A[A[A[A
2025-05-07T20:26:15.1600840Z 
2025-05-07T20:26:15.1600844Z 
2025-05-07T20:26:15.1601098Z 
2025-05-07T20:26:15.1601120Z 
2025-05-07T20:26:15.1601124Z 
2025-05-07T20:26:15.1603397Z 
2025-05-07T20:26:15.1656268Z cuda-nsight-12.8.55  | 113.2 MB  | #5         |  16% [A[A[A[A[A[A
2025-05-07T20:26:15.1657899Z 
2025-05-07T20:26:15.2229206Z nsight-compute-2025. | 320.6 MB  | #########  |  91% [A
2025-05-07T20:26:15.2494825Z libcublas-12.8.3.14  | 460.2 MB  | ######     |  60% 
2025-05-07T20:26:15.2495610Z 
2025-05-07T20:26:15.2495617Z 
2025-05-07T20:26:15.2495623Z 
2025-05-07T20:26:15.2495628Z 
2025-05-07T20:26:15.2497716Z 
2025-05-07T20:26:15.2600257Z libnpp-12.3.3.65     | 130.6 MB  | ##8        |  28% [A[A[A[A[A
2025-05-07T20:26:15.2600640Z 
2025-05-07T20:26:15.2600646Z 
2025-05-07T20:26:15.2600651Z 
2025-05-07T20:26:15.2600656Z 
2025-05-07T20:26:15.2600661Z 
2025-05-07T20:26:15.2604536Z 
2025-05-07T20:26:15.2735070Z cuda-nsight-12.8.55  | 113.2 MB  | #8         |  18% [A[A[A[A[A[A
2025-05-07T20:26:15.2737938Z 
2025-05-07T20:26:15.3321843Z nsight-compute-2025. | 320.6 MB  | #########1 |  92% [A
2025-05-07T20:26:15.3516307Z libcublas-12.8.3.14  | 460.2 MB  | ######     |  61% 
2025-05-07T20:26:15.3516639Z 
2025-05-07T20:26:15.3516645Z 
2025-05-07T20:26:15.3516651Z 
2025-05-07T20:26:15.3516656Z 
2025-05-07T20:26:15.3522838Z 
2025-05-07T20:26:15.3603154Z libnpp-12.3.3.65     | 130.6 MB  | ###        |  30% [A[A[A[A[A
2025-05-07T20:26:15.3603543Z 
2025-05-07T20:26:15.3603549Z 
2025-05-07T20:26:15.3603555Z 
2025-05-07T20:26:15.3605296Z 
2025-05-07T20:26:15.3605302Z 
2025-05-07T20:26:15.3605307Z 
2025-05-07T20:26:15.3737992Z cuda-nsight-12.8.55  | 113.2 MB  | ##1        |  21% [A[A[A[A[A[A
2025-05-07T20:26:15.3739499Z 
2025-05-07T20:26:15.4326830Z nsight-compute-2025. | 320.6 MB  | #########2 |  93% [A
2025-05-07T20:26:15.4650816Z libcublas-12.8.3.14  | 460.2 MB  | ######1    |  62% 
2025-05-07T20:26:15.4651197Z 
2025-05-07T20:26:15.4651204Z 
2025-05-07T20:26:15.4651209Z 
2025-05-07T20:26:15.4651215Z 
2025-05-07T20:26:15.4651220Z 
2025-05-07T20:26:15.4708781Z libnpp-12.3.3.65     | 130.6 MB  | ###2       |  32% [A[A[A[A[A
2025-05-07T20:26:15.4709167Z 
2025-05-07T20:26:15.4709174Z 
2025-05-07T20:26:15.4709179Z 
2025-05-07T20:26:15.4709185Z 
2025-05-07T20:26:15.4709192Z 
2025-05-07T20:26:15.4711048Z 
2025-05-07T20:26:15.4742124Z cuda-nsight-12.8.55  | 113.2 MB  | ##3        |  24% [A[A[A[A[A[A
2025-05-07T20:26:15.4742525Z 
2025-05-07T20:26:15.5327998Z nsight-compute-2025. | 320.6 MB  | #########3 |  94% [A
2025-05-07T20:26:15.5655373Z libcublas-12.8.3.14  | 460.2 MB  | ######2    |  62% 
2025-05-07T20:26:15.5655756Z 
2025-05-07T20:26:15.5655762Z 
2025-05-07T20:26:15.5655768Z 
2025-05-07T20:26:15.5655773Z 
2025-05-07T20:26:15.5655779Z 
2025-05-07T20:26:15.5751411Z libnpp-12.3.3.65     | 130.6 MB  | ###4       |  34% [A[A[A[A[A
2025-05-07T20:26:15.5751813Z 
2025-05-07T20:26:15.5751819Z 
2025-05-07T20:26:15.5751825Z 
2025-05-07T20:26:15.5751830Z 
2025-05-07T20:26:15.5751835Z 
2025-05-07T20:26:15.5754163Z 
2025-05-07T20:26:15.5851250Z cuda-nsight-12.8.55  | 113.2 MB  | ##6        |  26% [A[A[A[A[A[A
2025-05-07T20:26:15.5851657Z 
2025-05-07T20:26:15.6330368Z nsight-compute-2025. | 320.6 MB  | #########4 |  95% [A
2025-05-07T20:26:15.6711487Z libcublas-12.8.3.14  | 460.2 MB  | ######2    |  63% 
2025-05-07T20:26:15.6711842Z 
2025-05-07T20:26:15.6711848Z 
2025-05-07T20:26:15.6711854Z 
2025-05-07T20:26:15.6711859Z 
2025-05-07T20:26:15.6714923Z 
2025-05-07T20:26:15.6751678Z libnpp-12.3.3.65     | 130.6 MB  | ###6       |  36% [A[A[A[A[A
2025-05-07T20:26:15.6752060Z 
2025-05-07T20:26:15.6752066Z 
2025-05-07T20:26:15.6752071Z 
2025-05-07T20:26:15.6752077Z 
2025-05-07T20:26:15.6752082Z 
2025-05-07T20:26:15.6752087Z 
2025-05-07T20:26:15.6949284Z cuda-nsight-12.8.55  | 113.2 MB  | ##8        |  29% [A[A[A[A[A[A
2025-05-07T20:26:15.6952737Z 
2025-05-07T20:26:15.7357201Z nsight-compute-2025. | 320.6 MB  | #########5 |  95% [A
2025-05-07T20:26:15.7714750Z libcublas-12.8.3.14  | 460.2 MB  | ######3    |  64% 
2025-05-07T20:26:15.7715125Z 
2025-05-07T20:26:15.7715131Z 
2025-05-07T20:26:15.7715787Z 
2025-05-07T20:26:15.7715795Z 
2025-05-07T20:26:15.7717001Z 
2025-05-07T20:26:15.7751567Z libnpp-12.3.3.65     | 130.6 MB  | ###8       |  38% [A[A[A[A[A
2025-05-07T20:26:15.7751976Z 
2025-05-07T20:26:15.7751982Z 
2025-05-07T20:26:15.7751987Z 
2025-05-07T20:26:15.7751992Z 
2025-05-07T20:26:15.7751997Z 
2025-05-07T20:26:15.7752175Z 
2025-05-07T20:26:15.8037784Z cuda-nsight-12.8.55  | 113.2 MB  | ###1       |  32% [A[A[A[A[A[A
2025-05-07T20:26:15.8043632Z 
2025-05-07T20:26:15.8466662Z nsight-compute-2025. | 320.6 MB  | #########6 |  96% [A
2025-05-07T20:26:15.8723307Z libcublas-12.8.3.14  | 460.2 MB  | ######4    |  64% 
2025-05-07T20:26:15.8723635Z 
2025-05-07T20:26:15.8723641Z 
2025-05-07T20:26:15.8723647Z 
2025-05-07T20:26:15.8723652Z 
2025-05-07T20:26:15.8727132Z 
2025-05-07T20:26:15.8752671Z libnpp-12.3.3.65     | 130.6 MB  | ####       |  41% [A[A[A[A[A
2025-05-07T20:26:15.8753062Z 
2025-05-07T20:26:15.8753068Z 
2025-05-07T20:26:15.8753074Z 
2025-05-07T20:26:15.8753103Z 
2025-05-07T20:26:15.8753109Z 
2025-05-07T20:26:15.8753114Z 
2025-05-07T20:26:15.8847016Z cuda-nsight-12.8.55  | 113.2 MB  | ###4       |  34% [A[A[A[A[A[A
2025-05-07T20:26:15.8847409Z 
2025-05-07T20:26:15.8847415Z 
2025-05-07T20:26:15.8847449Z 
2025-05-07T20:26:15.9044338Z libcusolver-11.7.2.5 | 156.9 MB  | ########## | 100% [A[A[A
2025-05-07T20:26:15.9045847Z 
2025-05-07T20:26:15.9427853Z nsight-compute-2025. | 320.6 MB  | #########7 |  97% [A
2025-05-07T20:26:15.9428243Z 
2025-05-07T20:26:15.9428249Z 
2025-05-07T20:26:15.9428254Z 
2025-05-07T20:26:15.9428260Z 
2025-05-07T20:26:15.9428266Z 
2025-05-07T20:26:15.9428271Z 
2025-05-07T20:26:15.9428278Z 
2025-05-07T20:26:15.9574820Z cuda-nvvp-12.8.57    | 112.4 MB  |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:26:15.9867521Z libcublas-12.8.3.14  | 460.2 MB  | ######4    |  65% 
2025-05-07T20:26:15.9867965Z 
2025-05-07T20:26:15.9867971Z 
2025-05-07T20:26:15.9867976Z 
2025-05-07T20:26:15.9867981Z 
2025-05-07T20:26:15.9875207Z 
2025-05-07T20:26:16.0052065Z libnpp-12.3.3.65     | 130.6 MB  | ####2      |  43% [A[A[A[A[A
2025-05-07T20:26:16.0052453Z 
2025-05-07T20:26:16.0052460Z 
2025-05-07T20:26:16.0052465Z 
2025-05-07T20:26:16.0052470Z 
2025-05-07T20:26:16.0052475Z 
2025-05-07T20:26:16.0052481Z 
2025-05-07T20:26:16.0138692Z cuda-nsight-12.8.55  | 113.2 MB  | ###7       |  37% [A[A[A[A[A[A
2025-05-07T20:26:16.0141532Z 
2025-05-07T20:26:16.0431681Z nsight-compute-2025. | 320.6 MB  | #########8 |  98% [A
2025-05-07T20:26:16.0432069Z 
2025-05-07T20:26:16.0432075Z 
2025-05-07T20:26:16.0432081Z 
2025-05-07T20:26:16.0432086Z 
2025-05-07T20:26:16.0432091Z 
2025-05-07T20:26:16.0432096Z 
2025-05-07T20:26:16.0432854Z 
2025-05-07T20:26:16.0743460Z cuda-nvvp-12.8.57    | 112.4 MB  | 2          |   2% [A[A[A[A[A[A[A
2025-05-07T20:26:16.1030174Z libcublas-12.8.3.14  | 460.2 MB  | ######5    |  65% 
2025-05-07T20:26:16.1030540Z 
2025-05-07T20:26:16.1030548Z 
2025-05-07T20:26:16.1030554Z 
2025-05-07T20:26:16.1030559Z 
2025-05-07T20:26:16.1034407Z 
2025-05-07T20:26:16.1105491Z libnpp-12.3.3.65     | 130.6 MB  | ####4      |  45% [A[A[A[A[A
2025-05-07T20:26:16.1105885Z 
2025-05-07T20:26:16.1105891Z 
2025-05-07T20:26:16.1105896Z 
2025-05-07T20:26:16.1105901Z 
2025-05-07T20:26:16.1105906Z 
2025-05-07T20:26:16.1108939Z 
2025-05-07T20:26:16.1299068Z cuda-nsight-12.8.55  | 113.2 MB  | ###9       |  40% [A[A[A[A[A[A
2025-05-07T20:26:16.1308883Z 
2025-05-07T20:26:16.1437494Z nsight-compute-2025. | 320.6 MB  | #########8 |  99% [A
2025-05-07T20:26:16.1437851Z 
2025-05-07T20:26:16.1437856Z 
2025-05-07T20:26:16.1437859Z 
2025-05-07T20:26:16.1437863Z 
2025-05-07T20:26:16.1437867Z 
2025-05-07T20:26:16.1437870Z 
2025-05-07T20:26:16.1439159Z 
2025-05-07T20:26:16.1759816Z cuda-nvvp-12.8.57    | 112.4 MB  | 4          |   4% [A[A[A[A[A[A[A
2025-05-07T20:26:16.2159467Z libcublas-12.8.3.14  | 460.2 MB  | ######6    |  66% 
2025-05-07T20:26:16.2159828Z 
2025-05-07T20:26:16.2159832Z 
2025-05-07T20:26:16.2159836Z 
2025-05-07T20:26:16.2159840Z 
2025-05-07T20:26:16.2163238Z 
2025-05-07T20:26:16.2267424Z libnpp-12.3.3.65     | 130.6 MB  | ####6      |  47% [A[A[A[A[A
2025-05-07T20:26:16.2268271Z 
2025-05-07T20:26:16.2268277Z 
2025-05-07T20:26:16.2268282Z 
2025-05-07T20:26:16.2268288Z 
2025-05-07T20:26:16.2268293Z 
2025-05-07T20:26:16.2268299Z 
2025-05-07T20:26:16.2389772Z cuda-nsight-12.8.55  | 113.2 MB  | ####2      |  42% [A[A[A[A[A[A
2025-05-07T20:26:16.2390204Z 
2025-05-07T20:26:16.2442135Z nsight-compute-2025. | 320.6 MB  | #########9 | 100% [A
2025-05-07T20:26:16.2442504Z 
2025-05-07T20:26:16.2442510Z 
2025-05-07T20:26:16.2442525Z 
2025-05-07T20:26:16.2442530Z 
2025-05-07T20:26:16.2442535Z 
2025-05-07T20:26:16.2442541Z 
2025-05-07T20:26:16.2442546Z 
2025-05-07T20:26:16.2763052Z cuda-nvvp-12.8.57    | 112.4 MB  | 6          |   6% [A[A[A[A[A[A[A
2025-05-07T20:26:16.3223359Z libcublas-12.8.3.14  | 460.2 MB  | ######6    |  67% 
2025-05-07T20:26:16.3223709Z 
2025-05-07T20:26:16.3223715Z 
2025-05-07T20:26:16.3223721Z 
2025-05-07T20:26:16.3223762Z 
2025-05-07T20:26:16.3226326Z 
2025-05-07T20:26:16.3392059Z libnpp-12.3.3.65     | 130.6 MB  | ####8      |  49% [A[A[A[A[A
2025-05-07T20:26:16.3392449Z 
2025-05-07T20:26:16.3392455Z 
2025-05-07T20:26:16.3392657Z 
2025-05-07T20:26:16.3392662Z 
2025-05-07T20:26:16.3392667Z 
2025-05-07T20:26:16.3392673Z 
2025-05-07T20:26:16.3449416Z cuda-nsight-12.8.55  | 113.2 MB  | ####4      |  44% [A[A[A[A[A[A
2025-05-07T20:26:16.3449837Z 
2025-05-07T20:26:16.3449842Z 
2025-05-07T20:26:16.3449848Z 
2025-05-07T20:26:16.3449853Z 
2025-05-07T20:26:16.3449867Z 
2025-05-07T20:26:16.3449872Z 
2025-05-07T20:26:16.3449877Z 
2025-05-07T20:26:16.3763635Z cuda-nvvp-12.8.57    | 112.4 MB  | 8          |   9% [A[A[A[A[A[A[A
2025-05-07T20:26:16.4224872Z libcublas-12.8.3.14  | 460.2 MB  | ######7    |  67% 
2025-05-07T20:26:16.4225228Z 
2025-05-07T20:26:16.4225241Z 
2025-05-07T20:26:16.4225246Z 
2025-05-07T20:26:16.4225252Z 
2025-05-07T20:26:16.4227695Z 
2025-05-07T20:26:16.4394367Z libnpp-12.3.3.65     | 130.6 MB  | #####      |  50% [A[A[A[A[A
2025-05-07T20:26:16.4394773Z 
2025-05-07T20:26:16.4394779Z 
2025-05-07T20:26:16.4394784Z 
2025-05-07T20:26:16.4394789Z 
2025-05-07T20:26:16.4394794Z 
2025-05-07T20:26:16.4394800Z 
2025-05-07T20:26:16.4451170Z cuda-nsight-12.8.55  | 113.2 MB  | ####6      |  47% [A[A[A[A[A[A
2025-05-07T20:26:16.4451588Z 
2025-05-07T20:26:16.4451612Z 
2025-05-07T20:26:16.4451618Z 
2025-05-07T20:26:16.4451623Z 
2025-05-07T20:26:16.4451629Z 
2025-05-07T20:26:16.4451634Z 
2025-05-07T20:26:16.4451639Z 
2025-05-07T20:26:16.4864718Z cuda-nvvp-12.8.57    | 112.4 MB  | #1         |  11% [A[A[A[A[A[A[A
2025-05-07T20:26:16.5225416Z libcublas-12.8.3.14  | 460.2 MB  | ######7    |  68% 
2025-05-07T20:26:16.5225778Z 
2025-05-07T20:26:16.5225782Z 
2025-05-07T20:26:16.5225786Z 
2025-05-07T20:26:16.5225790Z 
2025-05-07T20:26:16.5227272Z 
2025-05-07T20:26:16.5451916Z libnpp-12.3.3.65     | 130.6 MB  | #####2     |  53% [A[A[A[A[A
2025-05-07T20:26:16.5452286Z 
2025-05-07T20:26:16.5452319Z 
2025-05-07T20:26:16.5452323Z 
2025-05-07T20:26:16.5452327Z 
2025-05-07T20:26:16.5452332Z 
2025-05-07T20:26:16.5452336Z 
2025-05-07T20:26:16.5457006Z 
2025-05-07T20:26:16.5875671Z cuda-nvvp-12.8.57    | 112.4 MB  | #3         |  14% [A[A[A[A[A[A[A
2025-05-07T20:26:16.5913006Z libcublas-12.8.3.14  | 460.2 MB  | ######8    |  69% 
2025-05-07T20:26:16.5913698Z 
2025-05-07T20:26:16.5913704Z 
2025-05-07T20:26:16.5913709Z 
2025-05-07T20:26:16.5913715Z 
2025-05-07T20:26:16.5913720Z 
2025-05-07T20:26:16.5913726Z 
2025-05-07T20:26:16.6228956Z cuda-nsight-12.8.55  | 113.2 MB  | ####8      |  49% [A[A[A[A[A[A
2025-05-07T20:26:16.6229358Z 
2025-05-07T20:26:16.6229363Z 
2025-05-07T20:26:16.6229366Z 
2025-05-07T20:26:16.6229370Z 
2025-05-07T20:26:16.6231964Z 
2025-05-07T20:26:16.6532062Z libnpp-12.3.3.65     | 130.6 MB  | #####4     |  55% [A[A[A[A[A
2025-05-07T20:26:16.6532421Z 
2025-05-07T20:26:16.6532425Z 
2025-05-07T20:26:16.6532428Z 
2025-05-07T20:26:16.6532432Z 
2025-05-07T20:26:16.6532436Z 
2025-05-07T20:26:16.6532725Z 
2025-05-07T20:26:16.6535589Z 
2025-05-07T20:26:16.6919132Z cuda-nvvp-12.8.57    | 112.4 MB  | #6         |  16% [A[A[A[A[A[A[A
2025-05-07T20:26:16.6919439Z 
2025-05-07T20:26:16.6919444Z 
2025-05-07T20:26:16.6919448Z 
2025-05-07T20:26:16.6919451Z 
2025-05-07T20:26:16.6919455Z 
2025-05-07T20:26:16.6919466Z 
2025-05-07T20:26:16.6971184Z cuda-nsight-12.8.55  | 113.2 MB  | #####1     |  51% [A[A[A[A[A[A
2025-05-07T20:26:16.7246170Z libcublas-12.8.3.14  | 460.2 MB  | ######9    |  69% 
2025-05-07T20:26:16.7246537Z 
2025-05-07T20:26:16.7246543Z 
2025-05-07T20:26:16.7246549Z 
2025-05-07T20:26:16.7246554Z 
2025-05-07T20:26:16.7250361Z 
2025-05-07T20:26:16.7532998Z libnpp-12.3.3.65     | 130.6 MB  | #####6     |  57% [A[A[A[A[A
2025-05-07T20:26:16.7533379Z 
2025-05-07T20:26:16.7533383Z 
2025-05-07T20:26:16.7533387Z 
2025-05-07T20:26:16.7533390Z 
2025-05-07T20:26:16.7533394Z 
2025-05-07T20:26:16.7533398Z 
2025-05-07T20:26:16.7533401Z 
2025-05-07T20:26:16.7924541Z cuda-nvvp-12.8.57    | 112.4 MB  | #8         |  18% [A[A[A[A[A[A[A
2025-05-07T20:26:16.7924955Z 
2025-05-07T20:26:16.7924961Z 
2025-05-07T20:26:16.7924964Z 
2025-05-07T20:26:16.7924968Z 
2025-05-07T20:26:16.7924973Z 
2025-05-07T20:26:16.7924977Z 
2025-05-07T20:26:16.8001294Z cuda-nsight-12.8.55  | 113.2 MB  | #####3     |  53% [A[A[A[A[A[A
2025-05-07T20:26:16.8288739Z libcublas-12.8.3.14  | 460.2 MB  | ######9    |  70% 
2025-05-07T20:26:16.8288999Z 
2025-05-07T20:26:16.8289097Z 
2025-05-07T20:26:16.8289105Z 
2025-05-07T20:26:16.8289127Z 
2025-05-07T20:26:16.8293722Z 
2025-05-07T20:26:16.8540540Z libnpp-12.3.3.65     | 130.6 MB  | #####8     |  59% [A[A[A[A[A
2025-05-07T20:26:16.8540882Z 
2025-05-07T20:26:16.8540888Z 
2025-05-07T20:26:16.8540894Z 
2025-05-07T20:26:16.8540910Z 
2025-05-07T20:26:16.8540915Z 
2025-05-07T20:26:16.8540920Z 
2025-05-07T20:26:16.8540925Z 
2025-05-07T20:26:16.8929481Z cuda-nvvp-12.8.57    | 112.4 MB  | ##         |  21% [A[A[A[A[A[A[A
2025-05-07T20:26:16.8929797Z 
2025-05-07T20:26:16.8929830Z 
2025-05-07T20:26:16.8929834Z 
2025-05-07T20:26:16.8929838Z 
2025-05-07T20:26:16.8929842Z 
2025-05-07T20:26:16.8934965Z 
2025-05-07T20:26:16.9008215Z cuda-nsight-12.8.55  | 113.2 MB  | #####5     |  56% [A[A[A[A[A[A
2025-05-07T20:26:16.9290592Z libcublas-12.8.3.14  | 460.2 MB  | #######    |  70% 
2025-05-07T20:26:16.9290932Z 
2025-05-07T20:26:16.9290963Z 
2025-05-07T20:26:16.9290967Z 
2025-05-07T20:26:16.9290971Z 
2025-05-07T20:26:16.9295150Z 
2025-05-07T20:26:16.9545108Z libnpp-12.3.3.65     | 130.6 MB  | ######     |  61% [A[A[A[A[A
2025-05-07T20:26:16.9545514Z 
2025-05-07T20:26:16.9545519Z 
2025-05-07T20:26:16.9545523Z 
2025-05-07T20:26:16.9545526Z 
2025-05-07T20:26:16.9545530Z 
2025-05-07T20:26:16.9545535Z 
2025-05-07T20:26:16.9546257Z 
2025-05-07T20:26:16.9929816Z cuda-nvvp-12.8.57    | 112.4 MB  | ##3        |  23% [A[A[A[A[A[A[A
2025-05-07T20:26:16.9930275Z 
2025-05-07T20:26:16.9930283Z 
2025-05-07T20:26:16.9930289Z 
2025-05-07T20:26:16.9930294Z 
2025-05-07T20:26:16.9930332Z 
2025-05-07T20:26:16.9934625Z 
2025-05-07T20:26:17.0009897Z cuda-nsight-12.8.55  | 113.2 MB  | #####8     |  58% [A[A[A[A[A[A
2025-05-07T20:26:17.0292278Z libcublas-12.8.3.14  | 460.2 MB  | #######    |  71% 
2025-05-07T20:26:17.0292609Z 
2025-05-07T20:26:17.0292613Z 
2025-05-07T20:26:17.0292617Z 
2025-05-07T20:26:17.0292621Z 
2025-05-07T20:26:17.0294335Z 
2025-05-07T20:26:17.0546574Z libnpp-12.3.3.65     | 130.6 MB  | ######2    |  63% [A[A[A[A[A
2025-05-07T20:26:17.0546883Z 
2025-05-07T20:26:17.0546888Z 
2025-05-07T20:26:17.0546892Z 
2025-05-07T20:26:17.0546895Z 
2025-05-07T20:26:17.0546899Z 
2025-05-07T20:26:17.0546903Z 
2025-05-07T20:26:17.0548564Z 
2025-05-07T20:26:17.1012431Z cuda-nvvp-12.8.57    | 112.4 MB  | ##5        |  26% [A[A[A[A[A[A[A
2025-05-07T20:26:17.1012800Z 
2025-05-07T20:26:17.1012805Z 
2025-05-07T20:26:17.1012808Z 
2025-05-07T20:26:17.1012812Z 
2025-05-07T20:26:17.1012816Z 
2025-05-07T20:26:17.1012827Z 
2025-05-07T20:26:17.1211839Z cuda-nsight-12.8.55  | 113.2 MB  | ######     |  61% [A[A[A[A[A[A
2025-05-07T20:26:17.1296610Z libcublas-12.8.3.14  | 460.2 MB  | #######1   |  72% 
2025-05-07T20:26:17.1296866Z 
2025-05-07T20:26:17.1296878Z 
2025-05-07T20:26:17.1296882Z 
2025-05-07T20:26:17.1296886Z 
2025-05-07T20:26:17.1300122Z 
2025-05-07T20:26:17.1670332Z libnpp-12.3.3.65     | 130.6 MB  | ######5    |  65% [A[A[A[A[A
2025-05-07T20:26:17.1670677Z 
2025-05-07T20:26:17.1670682Z 
2025-05-07T20:26:17.1670686Z 
2025-05-07T20:26:17.1670690Z 
2025-05-07T20:26:17.1670694Z 
2025-05-07T20:26:17.1670697Z 
2025-05-07T20:26:17.1670701Z 
2025-05-07T20:26:17.2020302Z cuda-nvvp-12.8.57    | 112.4 MB  | ##7        |  28% [A[A[A[A[A[A[A
2025-05-07T20:26:17.2020627Z 
2025-05-07T20:26:17.2020631Z 
2025-05-07T20:26:17.2020635Z 
2025-05-07T20:26:17.2020639Z 
2025-05-07T20:26:17.2020643Z 
2025-05-07T20:26:17.2020647Z 
2025-05-07T20:26:17.2297607Z cuda-nsight-12.8.55  | 113.2 MB  | ######2    |  63% [A[A[A[A[A[A
2025-05-07T20:26:17.2297933Z 
2025-05-07T20:26:17.2297967Z 
2025-05-07T20:26:17.2297972Z 
2025-05-07T20:26:17.2297976Z 
2025-05-07T20:26:17.2300148Z 
2025-05-07T20:26:17.2397399Z libnpp-12.3.3.65     | 130.6 MB  | ######7    |  67% [A[A[A[A[A
2025-05-07T20:26:17.2819958Z libcublas-12.8.3.14  | 460.2 MB  | #######2   |  72% 
2025-05-07T20:26:17.2820242Z 
2025-05-07T20:26:17.2820247Z 
2025-05-07T20:26:17.2820284Z 
2025-05-07T20:26:17.2820288Z 
2025-05-07T20:26:17.2820292Z 
2025-05-07T20:26:17.2820296Z 
2025-05-07T20:26:17.2821829Z 
2025-05-07T20:26:17.3053446Z cuda-nvvp-12.8.57    | 112.4 MB  | ###        |  30% [A[A[A[A[A[A[A
2025-05-07T20:26:17.3053792Z 
2025-05-07T20:26:17.3053798Z 
2025-05-07T20:26:17.3053803Z 
2025-05-07T20:26:17.3053809Z 
2025-05-07T20:26:17.3053814Z 
2025-05-07T20:26:17.3053819Z 
2025-05-07T20:26:17.3303114Z cuda-nsight-12.8.55  | 113.2 MB  | ######5    |  65% [A[A[A[A[A[A
2025-05-07T20:26:17.3303458Z 
2025-05-07T20:26:17.3303462Z 
2025-05-07T20:26:17.3303466Z 
2025-05-07T20:26:17.3303470Z 
2025-05-07T20:26:17.3304283Z 
2025-05-07T20:26:17.3442924Z libnpp-12.3.3.65     | 130.6 MB  | ######9    |  69% [A[A[A[A[A
2025-05-07T20:26:17.3983190Z libcublas-12.8.3.14  | 460.2 MB  | #######2   |  73% 
2025-05-07T20:26:17.3983486Z 
2025-05-07T20:26:17.3983490Z 
2025-05-07T20:26:17.3983494Z 
2025-05-07T20:26:17.3983498Z 
2025-05-07T20:26:17.3983502Z 
2025-05-07T20:26:17.3983506Z 
2025-05-07T20:26:17.3983542Z 
2025-05-07T20:26:17.4095846Z cuda-nvvp-12.8.57    | 112.4 MB  | ###2       |  32% [A[A[A[A[A[A[A
2025-05-07T20:26:17.4096278Z 
2025-05-07T20:26:17.4096284Z 
2025-05-07T20:26:17.4096290Z 
2025-05-07T20:26:17.4096296Z 
2025-05-07T20:26:17.4096301Z 
2025-05-07T20:26:17.4098087Z 
2025-05-07T20:26:17.4331324Z cuda-nsight-12.8.55  | 113.2 MB  | ######7    |  67% [A[A[A[A[A[A
2025-05-07T20:26:17.4331636Z 
2025-05-07T20:26:17.4331640Z 
2025-05-07T20:26:17.4331644Z 
2025-05-07T20:26:17.4331647Z 
2025-05-07T20:26:17.4332483Z 
2025-05-07T20:26:17.4479292Z libnpp-12.3.3.65     | 130.6 MB  | #######1   |  71% [A[A[A[A[A
2025-05-07T20:26:17.4987639Z libcublas-12.8.3.14  | 460.2 MB  | #######3   |  73% 
2025-05-07T20:26:17.4987989Z 
2025-05-07T20:26:17.4987994Z 
2025-05-07T20:26:17.4987997Z 
2025-05-07T20:26:17.4988001Z 
2025-05-07T20:26:17.4988005Z 
2025-05-07T20:26:17.4988009Z 
2025-05-07T20:26:17.5001568Z 
2025-05-07T20:26:17.5222185Z cuda-nvvp-12.8.57    | 112.4 MB  | ###4       |  35% [A[A[A[A[A[A[A
2025-05-07T20:26:17.5222782Z 
2025-05-07T20:26:17.5222787Z 
2025-05-07T20:26:17.5222791Z 
2025-05-07T20:26:17.5222795Z 
2025-05-07T20:26:17.5222798Z 
2025-05-07T20:26:17.5223830Z 
2025-05-07T20:26:17.5447274Z cuda-nsight-12.8.55  | 113.2 MB  | ######9    |  69% [A[A[A[A[A[A
2025-05-07T20:26:17.5447562Z 
2025-05-07T20:26:17.5447896Z 
2025-05-07T20:26:17.5447908Z 
2025-05-07T20:26:17.5447914Z 
2025-05-07T20:26:17.5451643Z 
2025-05-07T20:26:17.5529607Z libnpp-12.3.3.65     | 130.6 MB  | #######3   |  73% [A[A[A[A[A
2025-05-07T20:26:17.5989641Z libcublas-12.8.3.14  | 460.2 MB  | #######3   |  74% 
2025-05-07T20:26:17.5990208Z 
2025-05-07T20:26:17.5990230Z 
2025-05-07T20:26:17.5990235Z 
2025-05-07T20:26:17.5990241Z 
2025-05-07T20:26:17.5990246Z 
2025-05-07T20:26:17.5990252Z 
2025-05-07T20:26:17.5990258Z 
2025-05-07T20:26:17.6296248Z cuda-nvvp-12.8.57    | 112.4 MB  | ###6       |  37% [A[A[A[A[A[A[A
2025-05-07T20:26:17.6296611Z 
2025-05-07T20:26:17.6296616Z 
2025-05-07T20:26:17.6296656Z 
2025-05-07T20:26:17.6296662Z 
2025-05-07T20:26:17.6296667Z 
2025-05-07T20:26:17.6300126Z 
2025-05-07T20:26:17.6449186Z cuda-nsight-12.8.55  | 113.2 MB  | #######1   |  72% [A[A[A[A[A[A
2025-05-07T20:26:17.6449609Z 
2025-05-07T20:26:17.6449613Z 
2025-05-07T20:26:17.6449617Z 
2025-05-07T20:26:17.6449621Z 
2025-05-07T20:26:17.6450951Z 
2025-05-07T20:26:17.6534041Z libnpp-12.3.3.65     | 130.6 MB  | #######5   |  76% [A[A[A[A[A
2025-05-07T20:26:17.7076312Z libcublas-12.8.3.14  | 460.2 MB  | #######4   |  74% 
2025-05-07T20:26:17.7076654Z 
2025-05-07T20:26:17.7076659Z 
2025-05-07T20:26:17.7076672Z 
2025-05-07T20:26:17.7076707Z 
2025-05-07T20:26:17.7076711Z 
2025-05-07T20:26:17.7076714Z 
2025-05-07T20:26:17.7076718Z 
2025-05-07T20:26:17.7297543Z cuda-nvvp-12.8.57    | 112.4 MB  | ###9       |  39% [A[A[A[A[A[A[A
2025-05-07T20:26:17.7297862Z 
2025-05-07T20:26:17.7297869Z 
2025-05-07T20:26:17.7297873Z 
2025-05-07T20:26:17.7297877Z 
2025-05-07T20:26:17.7297880Z 
2025-05-07T20:26:17.7298372Z 
2025-05-07T20:26:17.7453707Z cuda-nsight-12.8.55  | 113.2 MB  | #######3   |  74% [A[A[A[A[A[A
2025-05-07T20:26:17.7454081Z 
2025-05-07T20:26:17.7454087Z 
2025-05-07T20:26:17.7454092Z 
2025-05-07T20:26:17.7454098Z 
2025-05-07T20:26:17.7458067Z 
2025-05-07T20:26:17.7540609Z libnpp-12.3.3.65     | 130.6 MB  | #######7   |  78% [A[A[A[A[A
2025-05-07T20:26:17.8126989Z libcublas-12.8.3.14  | 460.2 MB  | #######4   |  75% 
2025-05-07T20:26:17.8127281Z 
2025-05-07T20:26:17.8127286Z 
2025-05-07T20:26:17.8127290Z 
2025-05-07T20:26:17.8127302Z 
2025-05-07T20:26:17.8127306Z 
2025-05-07T20:26:17.8127309Z 
2025-05-07T20:26:17.8127727Z 
2025-05-07T20:26:17.8349562Z cuda-nvvp-12.8.57    | 112.4 MB  | ####1      |  41% [A[A[A[A[A[A[A
2025-05-07T20:26:17.8349921Z 
2025-05-07T20:26:17.8349925Z 
2025-05-07T20:26:17.8349929Z 
2025-05-07T20:26:17.8349933Z 
2025-05-07T20:26:17.8349937Z 
2025-05-07T20:26:17.8356291Z 
2025-05-07T20:26:17.8454913Z cuda-nsight-12.8.55  | 113.2 MB  | #######5   |  76% [A[A[A[A[A[A
2025-05-07T20:26:17.8455403Z 
2025-05-07T20:26:17.8455410Z 
2025-05-07T20:26:17.8455416Z 
2025-05-07T20:26:17.8455421Z 
2025-05-07T20:26:17.8455427Z 
2025-05-07T20:26:17.8586131Z libnpp-12.3.3.65     | 130.6 MB  | ########   |  80% [A[A[A[A[A
2025-05-07T20:26:17.9128577Z libcublas-12.8.3.14  | 460.2 MB  | #######5   |  76% 
2025-05-07T20:26:17.9128866Z 
2025-05-07T20:26:17.9128878Z 
2025-05-07T20:26:17.9128882Z 
2025-05-07T20:26:17.9128886Z 
2025-05-07T20:26:17.9128890Z 
2025-05-07T20:26:17.9128894Z 
2025-05-07T20:26:17.9131199Z 
2025-05-07T20:26:17.9389007Z cuda-nvvp-12.8.57    | 112.4 MB  | ####3      |  44% [A[A[A[A[A[A[A
2025-05-07T20:26:17.9389320Z 
2025-05-07T20:26:17.9389325Z 
2025-05-07T20:26:17.9389329Z 
2025-05-07T20:26:17.9389333Z 
2025-05-07T20:26:17.9389336Z 
2025-05-07T20:26:17.9409879Z 
2025-05-07T20:26:17.9456712Z cuda-nsight-12.8.55  | 113.2 MB  | #######7   |  78% [A[A[A[A[A[A
2025-05-07T20:26:17.9457071Z 
2025-05-07T20:26:17.9457075Z 
2025-05-07T20:26:17.9457408Z 
2025-05-07T20:26:17.9457412Z 
2025-05-07T20:26:17.9457416Z 
2025-05-07T20:26:17.9637767Z libnpp-12.3.3.65     | 130.6 MB  | ########2  |  82% [A[A[A[A[A
2025-05-07T20:26:18.0131827Z libcublas-12.8.3.14  | 460.2 MB  | #######6   |  76% 
2025-05-07T20:26:18.0132119Z 
2025-05-07T20:26:18.0132123Z 
2025-05-07T20:26:18.0132135Z 
2025-05-07T20:26:18.0132139Z 
2025-05-07T20:26:18.0132143Z 
2025-05-07T20:26:18.0132146Z 
2025-05-07T20:26:18.0132150Z 
2025-05-07T20:26:18.0394441Z cuda-nvvp-12.8.57    | 112.4 MB  | ####5      |  46% [A[A[A[A[A[A[A
2025-05-07T20:26:18.0394760Z 
2025-05-07T20:26:18.0394764Z 
2025-05-07T20:26:18.0395038Z 
2025-05-07T20:26:18.0395043Z 
2025-05-07T20:26:18.0395047Z 
2025-05-07T20:26:18.0395579Z 
2025-05-07T20:26:18.0463205Z cuda-nsight-12.8.55  | 113.2 MB  | ########   |  80% [A[A[A[A[A[A
2025-05-07T20:26:18.0463507Z 
2025-05-07T20:26:18.0463511Z 
2025-05-07T20:26:18.0463515Z 
2025-05-07T20:26:18.0463519Z 
2025-05-07T20:26:18.0465357Z 
2025-05-07T20:26:18.0741935Z libnpp-12.3.3.65     | 130.6 MB  | ########4  |  85% [A[A[A[A[A
2025-05-07T20:26:18.1265181Z libcublas-12.8.3.14  | 460.2 MB  | #######6   |  77% 
2025-05-07T20:26:18.1265590Z 
2025-05-07T20:26:18.1265597Z 
2025-05-07T20:26:18.1265602Z 
2025-05-07T20:26:18.1265608Z 
2025-05-07T20:26:18.1265624Z 
2025-05-07T20:26:18.1265629Z 
2025-05-07T20:26:18.1265635Z 
2025-05-07T20:26:18.1394789Z cuda-nvvp-12.8.57    | 112.4 MB  | ####7      |  48% [A[A[A[A[A[A[A
2025-05-07T20:26:18.1395108Z 
2025-05-07T20:26:18.1395112Z 
2025-05-07T20:26:18.1395123Z 
2025-05-07T20:26:18.1395127Z 
2025-05-07T20:26:18.1395131Z 
2025-05-07T20:26:18.1397955Z 
2025-05-07T20:26:18.1595134Z cuda-nsight-12.8.55  | 113.2 MB  | ########2  |  82% [A[A[A[A[A[A
2025-05-07T20:26:18.1595483Z 
2025-05-07T20:26:18.1595489Z 
2025-05-07T20:26:18.1595494Z 
2025-05-07T20:26:18.1595499Z 
2025-05-07T20:26:18.1597499Z 
2025-05-07T20:26:18.1745213Z libnpp-12.3.3.65     | 130.6 MB  | ########6  |  87% [A[A[A[A[A
2025-05-07T20:26:18.2380118Z libcublas-12.8.3.14  | 460.2 MB  | #######7   |  77% 
2025-05-07T20:26:18.2380436Z 
2025-05-07T20:26:18.2380444Z 
2025-05-07T20:26:18.2380464Z 
2025-05-07T20:26:18.2380470Z 
2025-05-07T20:26:18.2380478Z 
2025-05-07T20:26:18.2380485Z 
2025-05-07T20:26:18.2380518Z 
2025-05-07T20:26:18.2448107Z cuda-nvvp-12.8.57    | 112.4 MB  | #####      |  50% [A[A[A[A[A[A[A
2025-05-07T20:26:18.2448415Z 
2025-05-07T20:26:18.2448419Z 
2025-05-07T20:26:18.2448951Z 
2025-05-07T20:26:18.2448966Z 
2025-05-07T20:26:18.2448976Z 
2025-05-07T20:26:18.2453905Z 
2025-05-07T20:26:18.2595205Z cuda-nsight-12.8.55  | 113.2 MB  | ########4  |  84% [A[A[A[A[A[A
2025-05-07T20:26:18.2595677Z 
2025-05-07T20:26:18.2595683Z 
2025-05-07T20:26:18.2595688Z 
2025-05-07T20:26:18.2595693Z 
2025-05-07T20:26:18.2597744Z 
2025-05-07T20:26:18.2746678Z libnpp-12.3.3.65     | 130.6 MB  | ########9  |  89% [A[A[A[A[A
2025-05-07T20:26:18.3381128Z libcublas-12.8.3.14  | 460.2 MB  | #######7   |  78% 
2025-05-07T20:26:18.3381399Z 
2025-05-07T20:26:18.3381441Z 
2025-05-07T20:26:18.3381447Z 
2025-05-07T20:26:18.3381451Z 
2025-05-07T20:26:18.3381457Z 
2025-05-07T20:26:18.3381460Z 
2025-05-07T20:26:18.3381466Z 
2025-05-07T20:26:18.3451363Z cuda-nvvp-12.8.57    | 112.4 MB  | #####2     |  52% [A[A[A[A[A[A[A
2025-05-07T20:26:18.3451671Z 
2025-05-07T20:26:18.3451675Z 
2025-05-07T20:26:18.3451679Z 
2025-05-07T20:26:18.3451682Z 
2025-05-07T20:26:18.3451687Z 
2025-05-07T20:26:18.3451691Z 
2025-05-07T20:26:18.3669746Z cuda-nsight-12.8.55  | 113.2 MB  | ########6  |  87% [A[A[A[A[A[A
2025-05-07T20:26:18.3670167Z 
2025-05-07T20:26:18.3670173Z 
2025-05-07T20:26:18.3670178Z 
2025-05-07T20:26:18.3670204Z 
2025-05-07T20:26:18.3670210Z 
2025-05-07T20:26:18.3756020Z libnpp-12.3.3.65     | 130.6 MB  | #########1 |  91% [A[A[A[A[A
2025-05-07T20:26:18.4388906Z libcublas-12.8.3.14  | 460.2 MB  | #######8   |  78% 
2025-05-07T20:26:18.4389268Z 
2025-05-07T20:26:18.4389274Z 
2025-05-07T20:26:18.4389279Z 
2025-05-07T20:26:18.4389285Z 
2025-05-07T20:26:18.4389567Z 
2025-05-07T20:26:18.4389572Z 
2025-05-07T20:26:18.4392821Z 
2025-05-07T20:26:18.4451793Z cuda-nvvp-12.8.57    | 112.4 MB  | #####4     |  55% [A[A[A[A[A[A[A
2025-05-07T20:26:18.4452175Z 
2025-05-07T20:26:18.4452180Z 
2025-05-07T20:26:18.4452186Z 
2025-05-07T20:26:18.4452191Z 
2025-05-07T20:26:18.4452197Z 
2025-05-07T20:26:18.4452202Z 
2025-05-07T20:26:18.4720764Z cuda-nsight-12.8.55  | 113.2 MB  | ########9  |  89% [A[A[A[A[A[A
2025-05-07T20:26:18.4721147Z 
2025-05-07T20:26:18.4721153Z 
2025-05-07T20:26:18.4721158Z 
2025-05-07T20:26:18.4721163Z 
2025-05-07T20:26:18.4723086Z 
2025-05-07T20:26:18.4788576Z libnpp-12.3.3.65     | 130.6 MB  | #########3 |  93% [A[A[A[A[A
2025-05-07T20:26:18.5389221Z libcublas-12.8.3.14  | 460.2 MB  | #######8   |  79% 
2025-05-07T20:26:18.5389566Z 
2025-05-07T20:26:18.5389573Z 
2025-05-07T20:26:18.5389579Z 
2025-05-07T20:26:18.5389585Z 
2025-05-07T20:26:18.5389591Z 
2025-05-07T20:26:18.5389596Z 
2025-05-07T20:26:18.5392376Z 
2025-05-07T20:26:18.5548483Z cuda-nvvp-12.8.57    | 112.4 MB  | #####6     |  57% [A[A[A[A[A[A[A
2025-05-07T20:26:18.5548874Z 
2025-05-07T20:26:18.5548881Z 
2025-05-07T20:26:18.5548886Z 
2025-05-07T20:26:18.5548891Z 
2025-05-07T20:26:18.5548897Z 
2025-05-07T20:26:18.5548902Z 
2025-05-07T20:26:18.5789476Z cuda-nsight-12.8.55  | 113.2 MB  | #########1 |  91% [A[A[A[A[A[A
2025-05-07T20:26:18.5920957Z libcublas-12.8.3.14  | 460.2 MB  | #######9   |  79% 
2025-05-07T20:26:18.5921258Z 
2025-05-07T20:26:18.5921262Z 
2025-05-07T20:26:18.5921266Z 
2025-05-07T20:26:18.5921270Z 
2025-05-07T20:26:18.5926080Z 
2025-05-07T20:26:18.6414895Z libnpp-12.3.3.65     | 130.6 MB  | #########5 |  96% [A[A[A[A[A
2025-05-07T20:26:18.6415221Z 
2025-05-07T20:26:18.6415227Z 
2025-05-07T20:26:18.6415232Z 
2025-05-07T20:26:18.6415237Z 
2025-05-07T20:26:18.6415243Z 
2025-05-07T20:26:18.6415248Z 
2025-05-07T20:26:18.6421455Z 
2025-05-07T20:26:18.6550867Z cuda-nvvp-12.8.57    | 112.4 MB  | #####9     |  59% [A[A[A[A[A[A[A
2025-05-07T20:26:18.6551162Z 
2025-05-07T20:26:18.6551166Z 
2025-05-07T20:26:18.6551170Z 
2025-05-07T20:26:18.6551173Z 
2025-05-07T20:26:18.6551177Z 
2025-05-07T20:26:18.6551180Z 
2025-05-07T20:26:18.6796718Z cuda-nsight-12.8.55  | 113.2 MB  | #########3 |  94% [A[A[A[A[A[A
2025-05-07T20:26:18.7012442Z libcublas-12.8.3.14  | 460.2 MB  | #######9   |  80% 
2025-05-07T20:26:18.7012694Z 
2025-05-07T20:26:18.7012698Z 
2025-05-07T20:26:18.7012702Z 
2025-05-07T20:26:18.7012706Z 
2025-05-07T20:26:18.7014441Z 
2025-05-07T20:26:18.7483352Z libnpp-12.3.3.65     | 130.6 MB  | #########7 |  98% [A[A[A[A[A
2025-05-07T20:26:18.7483720Z 
2025-05-07T20:26:18.7483725Z 
2025-05-07T20:26:18.7483740Z 
2025-05-07T20:26:18.7483744Z 
2025-05-07T20:26:18.7483747Z 
2025-05-07T20:26:18.7483751Z 
2025-05-07T20:26:18.7488221Z 
2025-05-07T20:26:18.7553585Z cuda-nvvp-12.8.57    | 112.4 MB  | ######1    |  61% [A[A[A[A[A[A[A
2025-05-07T20:26:18.7553915Z 
2025-05-07T20:26:18.7553920Z 
2025-05-07T20:26:18.7553924Z 
2025-05-07T20:26:18.7553937Z 
2025-05-07T20:26:18.7553941Z 
2025-05-07T20:26:18.7553945Z 
2025-05-07T20:26:18.7902438Z cuda-nsight-12.8.55  | 113.2 MB  | #########5 |  96% [A[A[A[A[A[A
2025-05-07T20:26:18.8050507Z libcublas-12.8.3.14  | 460.2 MB  | ########   |  81% 
2025-05-07T20:26:18.8050836Z 
2025-05-07T20:26:18.8050842Z 
2025-05-07T20:26:18.8050847Z 
2025-05-07T20:26:18.8050853Z 
2025-05-07T20:26:18.8054601Z 
2025-05-07T20:26:18.8488948Z libnpp-12.3.3.65     | 130.6 MB  | #########9 |  99% [A[A[A[A[A
2025-05-07T20:26:18.8489329Z 
2025-05-07T20:26:18.8489334Z 
2025-05-07T20:26:18.8489338Z 
2025-05-07T20:26:18.8489341Z 
2025-05-07T20:26:18.8489345Z 
2025-05-07T20:26:18.8489363Z 
2025-05-07T20:26:18.8492602Z 
2025-05-07T20:26:18.8560511Z cuda-nvvp-12.8.57    | 112.4 MB  | ######3    |  64% [A[A[A[A[A[A[A
2025-05-07T20:26:18.8560886Z 
2025-05-07T20:26:18.8560892Z 
2025-05-07T20:26:18.8560898Z 
2025-05-07T20:26:18.8560903Z 
2025-05-07T20:26:18.8560909Z 
2025-05-07T20:26:18.8560914Z 
2025-05-07T20:26:18.8907454Z cuda-nsight-12.8.55  | 113.2 MB  | #########8 |  98% [A[A[A[A[A[A
2025-05-07T20:26:18.9489735Z libcublas-12.8.3.14  | 460.2 MB  | ########1  |  81% 
2025-05-07T20:26:18.9490062Z 
2025-05-07T20:26:18.9490068Z 
2025-05-07T20:26:18.9490074Z 
2025-05-07T20:26:18.9490079Z 
2025-05-07T20:26:18.9490084Z 
2025-05-07T20:26:18.9490089Z 
2025-05-07T20:26:18.9491617Z 
2025-05-07T20:26:18.9910877Z cuda-nvvp-12.8.57    | 112.4 MB  | ######6    |  66% [A[A[A[A[A[A[A
2025-05-07T20:26:19.0493638Z libcublas-12.8.3.14  | 460.2 MB  | ########1  |  82% 
2025-05-07T20:26:19.0493885Z 
2025-05-07T20:26:19.0493890Z 
2025-05-07T20:26:19.0493894Z 
2025-05-07T20:26:19.0494150Z 
2025-05-07T20:26:19.0494155Z 
2025-05-07T20:26:19.0494159Z 
2025-05-07T20:26:19.0494286Z 
2025-05-07T20:26:19.0915976Z cuda-nvvp-12.8.57    | 112.4 MB  | ######8    |  69% [A[A[A[A[A[A[A
2025-05-07T20:26:19.1495285Z libcublas-12.8.3.14  | 460.2 MB  | ########2  |  82% 
2025-05-07T20:26:19.1495532Z 
2025-05-07T20:26:19.1495536Z 
2025-05-07T20:26:19.1495553Z 
2025-05-07T20:26:19.1495557Z 
2025-05-07T20:26:19.1495561Z 
2025-05-07T20:26:19.1495564Z 
2025-05-07T20:26:19.1496137Z 
2025-05-07T20:26:19.1917283Z cuda-nvvp-12.8.57    | 112.4 MB  | #######1   |  72% [A[A[A[A[A[A[A
2025-05-07T20:26:19.2496291Z libcublas-12.8.3.14  | 460.2 MB  | ########3  |  83% 
2025-05-07T20:26:19.2496551Z 
2025-05-07T20:26:19.2496825Z 
2025-05-07T20:26:19.2496829Z 
2025-05-07T20:26:19.2496833Z 
2025-05-07T20:26:19.2496919Z 
2025-05-07T20:26:19.2496934Z 
2025-05-07T20:26:19.2497858Z 
2025-05-07T20:26:19.2917962Z cuda-nvvp-12.8.57    | 112.4 MB  | #######4   |  74% [A[A[A[A[A[A[A
2025-05-07T20:26:19.3496887Z libcublas-12.8.3.14  | 460.2 MB  | ########3  |  84% 
2025-05-07T20:26:19.3497221Z 
2025-05-07T20:26:19.3497227Z 
2025-05-07T20:26:19.3497232Z 
2025-05-07T20:26:19.3497238Z 
2025-05-07T20:26:19.3497243Z 
2025-05-07T20:26:19.3497248Z 
2025-05-07T20:26:19.3497254Z 
2025-05-07T20:26:19.3921825Z cuda-nvvp-12.8.57    | 112.4 MB  | #######7   |  78% [A[A[A[A[A[A[A
2025-05-07T20:26:19.4547207Z libcublas-12.8.3.14  | 460.2 MB  | ########4  |  85% 
2025-05-07T20:26:19.4547549Z 
2025-05-07T20:26:19.4547555Z 
2025-05-07T20:26:19.4547560Z 
2025-05-07T20:26:19.4547565Z 
2025-05-07T20:26:19.4547571Z 
2025-05-07T20:26:19.4547576Z 
2025-05-07T20:26:19.4547581Z 
2025-05-07T20:26:19.4922184Z cuda-nvvp-12.8.57    | 112.4 MB  | ########   |  80% [A[A[A[A[A[A[A
2025-05-07T20:26:19.5557618Z libcublas-12.8.3.14  | 460.2 MB  | ########5  |  85% 
2025-05-07T20:26:19.5557945Z 
2025-05-07T20:26:19.5557951Z 
2025-05-07T20:26:19.5557957Z 
2025-05-07T20:26:19.5557962Z 
2025-05-07T20:26:19.5557967Z 
2025-05-07T20:26:19.5557988Z 
2025-05-07T20:26:19.5557993Z 
2025-05-07T20:26:19.5923842Z cuda-nvvp-12.8.57    | 112.4 MB  | ########3  |  83% [A[A[A[A[A[A[A
2025-05-07T20:26:19.6561202Z libcublas-12.8.3.14  | 460.2 MB  | ########6  |  86% 
2025-05-07T20:26:19.6561455Z 
2025-05-07T20:26:19.6561459Z 
2025-05-07T20:26:19.6561463Z 
2025-05-07T20:26:19.6561467Z 
2025-05-07T20:26:19.6561488Z 
2025-05-07T20:26:19.6561492Z 
2025-05-07T20:26:19.6561497Z 
2025-05-07T20:26:19.6923837Z cuda-nvvp-12.8.57    | 112.4 MB  | ########5  |  86% [A[A[A[A[A[A[A
2025-05-07T20:26:19.7563208Z libcublas-12.8.3.14  | 460.2 MB  | ########6  |  87% 
2025-05-07T20:26:19.7563565Z 
2025-05-07T20:26:19.7563571Z 
2025-05-07T20:26:19.7563576Z 
2025-05-07T20:26:19.7563581Z 
2025-05-07T20:26:19.7563587Z 
2025-05-07T20:26:19.7563592Z 
2025-05-07T20:26:19.7563597Z 
2025-05-07T20:26:19.7929643Z cuda-nvvp-12.8.57    | 112.4 MB  | ########8  |  89% [A[A[A[A[A[A[A
2025-05-07T20:26:19.8579778Z libcublas-12.8.3.14  | 460.2 MB  | ########7  |  88% 
2025-05-07T20:26:19.8580156Z 
2025-05-07T20:26:19.8580161Z 
2025-05-07T20:26:19.8580173Z 
2025-05-07T20:26:19.8580177Z 
2025-05-07T20:26:19.8580181Z 
2025-05-07T20:26:19.8580184Z 
2025-05-07T20:26:19.8580188Z 
2025-05-07T20:26:19.9011080Z cuda-nvvp-12.8.57    | 112.4 MB  | #########1 |  92% [A[A[A[A[A[A[A
2025-05-07T20:26:19.9584321Z libcublas-12.8.3.14  | 460.2 MB  | ########8  |  88% 
2025-05-07T20:26:19.9584953Z 
2025-05-07T20:26:19.9584959Z 
2025-05-07T20:26:19.9584965Z 
2025-05-07T20:26:19.9584970Z 
2025-05-07T20:26:19.9584976Z 
2025-05-07T20:26:19.9584981Z 
2025-05-07T20:26:19.9584987Z 
2025-05-07T20:26:20.0012034Z cuda-nvvp-12.8.57    | 112.4 MB  | #########4 |  95% [A[A[A[A[A[A[A
2025-05-07T20:26:20.0623730Z libcublas-12.8.3.14  | 460.2 MB  | ########9  |  89% 
2025-05-07T20:26:20.0623990Z 
2025-05-07T20:26:20.0623995Z 
2025-05-07T20:26:20.0623999Z 
2025-05-07T20:26:20.0624002Z 
2025-05-07T20:26:20.0624006Z 
2025-05-07T20:26:20.0624010Z 
2025-05-07T20:26:20.0624013Z 
2025-05-07T20:26:20.1034318Z cuda-nvvp-12.8.57    | 112.4 MB  | #########7 |  97% [A[A[A[A[A[A[A
2025-05-07T20:26:20.2068703Z libcublas-12.8.3.14  | 460.2 MB  | ########9  |  90% 
2025-05-07T20:26:20.3070013Z libcublas-12.8.3.14  | 460.2 MB  | #########  |  91% 
2025-05-07T20:26:20.4180243Z libcublas-12.8.3.14  | 460.2 MB  | #########1 |  91% 
2025-05-07T20:26:20.5188641Z libcublas-12.8.3.14  | 460.2 MB  | #########2 |  92% 
2025-05-07T20:26:20.6199143Z libcublas-12.8.3.14  | 460.2 MB  | #########2 |  93% 
2025-05-07T20:26:20.7202913Z libcublas-12.8.3.14  | 460.2 MB  | #########3 |  94% 
2025-05-07T20:26:20.8204279Z libcublas-12.8.3.14  | 460.2 MB  | #########4 |  95% 
2025-05-07T20:26:20.9240924Z libcublas-12.8.3.14  | 460.2 MB  | #########5 |  95% 
2025-05-07T20:26:21.0241455Z libcublas-12.8.3.14  | 460.2 MB  | #########6 |  96% 
2025-05-07T20:26:21.1306836Z libcublas-12.8.3.14  | 460.2 MB  | #########7 |  97% 
2025-05-07T20:26:21.2419086Z libcublas-12.8.3.14  | 460.2 MB  | #########7 |  98% 
2025-05-07T20:26:21.3431888Z libcublas-12.8.3.14  | 460.2 MB  | #########8 |  99% 
2025-05-07T20:26:22.7130982Z libcublas-12.8.3.14  | 460.2 MB  | #########9 |  99% 
2025-05-07T20:26:22.7131271Z 
2025-05-07T20:26:22.7131276Z 
2025-05-07T20:26:22.7131279Z 
2025-05-07T20:26:22.7131283Z 
2025-05-07T20:26:22.7131300Z 
2025-05-07T20:26:22.7135001Z 
2025-05-07T20:26:22.7622664Z cuda-nsight-12.8.55  | 113.2 MB  | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:26:22.7623071Z 
2025-05-07T20:26:22.7623075Z 
2025-05-07T20:26:22.7623086Z 
2025-05-07T20:26:22.7623093Z 
2025-05-07T20:26:22.7623099Z 
2025-05-07T20:26:22.7623104Z 
2025-05-07T20:26:22.7623109Z 
2025-05-07T20:26:22.7625424Z 
2025-05-07T20:26:22.8628518Z cuda-nvrtc-12.8.61   | 63.1 MB   |            |   0% [A[A[A[A[A[A[A[A
2025-05-07T20:26:22.8628848Z 
2025-05-07T20:26:22.8628852Z 
2025-05-07T20:26:22.8628856Z 
2025-05-07T20:26:22.8628860Z 
2025-05-07T20:26:22.8628864Z 
2025-05-07T20:26:22.8628869Z 
2025-05-07T20:26:22.8628873Z 
2025-05-07T20:26:22.8628876Z 
2025-05-07T20:26:22.9454379Z cuda-nvrtc-12.8.61   | 63.1 MB   | 5          |   6% [A[A[A[A[A[A[A[A
2025-05-07T20:26:22.9454757Z 
2025-05-07T20:26:22.9454762Z 
2025-05-07T20:26:22.9454766Z 
2025-05-07T20:26:22.9456445Z 
2025-05-07T20:26:22.9629692Z libcufft-11.3.3.41   | 147.4 MB  | ########## | 100% [A[A[A[A
2025-05-07T20:26:22.9630032Z 
2025-05-07T20:26:22.9630036Z 
2025-05-07T20:26:22.9630063Z 
2025-05-07T20:26:22.9630067Z 
2025-05-07T20:26:22.9630071Z 
2025-05-07T20:26:22.9630075Z 
2025-05-07T20:26:22.9630078Z 
2025-05-07T20:26:22.9634156Z 
2025-05-07T20:26:23.0632036Z cuda-nvrtc-12.8.61   | 63.1 MB   | #1         |  11% [A[A[A[A[A[A[A[A
2025-05-07T20:26:23.0632368Z 
2025-05-07T20:26:23.0632372Z 
2025-05-07T20:26:23.0632376Z 
2025-05-07T20:26:23.0632380Z 
2025-05-07T20:26:23.0632384Z 
2025-05-07T20:26:23.0632387Z 
2025-05-07T20:26:23.0632391Z 
2025-05-07T20:26:23.0633494Z 
2025-05-07T20:26:23.1693223Z cuda-nvrtc-12.8.61   | 63.1 MB   | #7         |  17% [A[A[A[A[A[A[A[A
2025-05-07T20:26:23.1693626Z 
2025-05-07T20:26:23.1693661Z 
2025-05-07T20:26:23.1693668Z 
2025-05-07T20:26:23.1693673Z 
2025-05-07T20:26:23.1693678Z 
2025-05-07T20:26:23.1693683Z 
2025-05-07T20:26:23.1693689Z 
2025-05-07T20:26:23.1695511Z 
2025-05-07T20:26:23.2017507Z cuda-nvrtc-12.8.61   | 63.1 MB   | ##3        |  23% [A[A[A[A[A[A[A[A
2025-05-07T20:26:23.2017933Z 
2025-05-07T20:26:23.2018249Z 
2025-05-07T20:26:23.2018255Z 
2025-05-07T20:26:23.2018260Z 
2025-05-07T20:26:23.2018265Z 
2025-05-07T20:26:23.2448031Z libnpp-12.3.3.65     | 130.6 MB  | ########## | 100% [A[A[A[A[A
2025-05-07T20:26:23.2448504Z 
2025-05-07T20:26:23.2448511Z 
2025-05-07T20:26:23.2448516Z 
2025-05-07T20:26:23.2448521Z 
2025-05-07T20:26:23.2448527Z 
2025-05-07T20:26:23.2448532Z 
2025-05-07T20:26:23.2448537Z 
2025-05-07T20:26:23.2448543Z 
2025-05-07T20:26:23.2448548Z 
2025-05-07T20:26:23.2733494Z libcurand-10.3.9.55  | 43.6 MB   |            |   0% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:23.2733981Z 
2025-05-07T20:26:23.2733987Z 
2025-05-07T20:26:23.2734244Z 
2025-05-07T20:26:23.2734251Z 
2025-05-07T20:26:23.2734257Z 
2025-05-07T20:26:23.2734262Z 
2025-05-07T20:26:23.2734267Z 
2025-05-07T20:26:23.2734272Z 
2025-05-07T20:26:23.3455771Z cuda-nvrtc-12.8.61   | 63.1 MB   | ##8        |  29% [A[A[A[A[A[A[A[A
2025-05-07T20:26:23.3456173Z 
2025-05-07T20:26:23.3456179Z 
2025-05-07T20:26:23.3456184Z 
2025-05-07T20:26:23.3456213Z 
2025-05-07T20:26:23.3456219Z 
2025-05-07T20:26:23.3456225Z 
2025-05-07T20:26:23.3456230Z 
2025-05-07T20:26:23.3456245Z 
2025-05-07T20:26:23.3456251Z 
2025-05-07T20:26:23.3810087Z libcurand-10.3.9.55  | 43.6 MB   | 7          |   7% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:23.3810394Z 
2025-05-07T20:26:23.3810399Z 
2025-05-07T20:26:23.3810402Z 
2025-05-07T20:26:23.3810406Z 
2025-05-07T20:26:23.3810410Z 
2025-05-07T20:26:23.3810414Z 
2025-05-07T20:26:23.3810417Z 
2025-05-07T20:26:23.3811584Z 
2025-05-07T20:26:23.4457634Z cuda-nvrtc-12.8.61   | 63.1 MB   | ###4       |  34% [A[A[A[A[A[A[A[A
2025-05-07T20:26:23.4457930Z 
2025-05-07T20:26:23.4457954Z 
2025-05-07T20:26:23.4457958Z 
2025-05-07T20:26:23.4457962Z 
2025-05-07T20:26:23.4457974Z 
2025-05-07T20:26:23.4457978Z 
2025-05-07T20:26:23.4457982Z 
2025-05-07T20:26:23.4457986Z 
2025-05-07T20:26:23.4457989Z 
2025-05-07T20:26:23.4810632Z libcurand-10.3.9.55  | 43.6 MB   | #5         |  15% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:23.4810950Z 
2025-05-07T20:26:23.4810954Z 
2025-05-07T20:26:23.4810958Z 
2025-05-07T20:26:23.4810961Z 
2025-05-07T20:26:23.4810965Z 
2025-05-07T20:26:23.4810969Z 
2025-05-07T20:26:23.4810973Z 
2025-05-07T20:26:23.4811070Z 
2025-05-07T20:26:23.5461467Z cuda-nvrtc-12.8.61   | 63.1 MB   | ####       |  40% [A[A[A[A[A[A[A[A
2025-05-07T20:26:23.5461814Z 
2025-05-07T20:26:23.5461818Z 
2025-05-07T20:26:23.5461822Z 
2025-05-07T20:26:23.5461826Z 
2025-05-07T20:26:23.5461829Z 
2025-05-07T20:26:23.5461833Z 
2025-05-07T20:26:23.5461837Z 
2025-05-07T20:26:23.5461840Z 
2025-05-07T20:26:23.5462575Z 
2025-05-07T20:26:23.5828811Z libcurand-10.3.9.55  | 43.6 MB   | ##3        |  23% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:23.5829117Z 
2025-05-07T20:26:23.5829122Z 
2025-05-07T20:26:23.5829125Z 
2025-05-07T20:26:23.5829129Z 
2025-05-07T20:26:23.5829133Z 
2025-05-07T20:26:23.5829136Z 
2025-05-07T20:26:23.5829140Z 
2025-05-07T20:26:23.5829144Z 
2025-05-07T20:26:23.6492027Z cuda-nvrtc-12.8.61   | 63.1 MB   | ####5      |  46% [A[A[A[A[A[A[A[A
2025-05-07T20:26:23.6492356Z 
2025-05-07T20:26:23.6492360Z 
2025-05-07T20:26:23.6492364Z 
2025-05-07T20:26:23.6492368Z 
2025-05-07T20:26:23.6492372Z 
2025-05-07T20:26:23.6492375Z 
2025-05-07T20:26:23.6492379Z 
2025-05-07T20:26:23.6492383Z 
2025-05-07T20:26:23.6492387Z 
2025-05-07T20:26:23.6862925Z libcurand-10.3.9.55  | 43.6 MB   | ###        |  31% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:23.6863219Z 
2025-05-07T20:26:23.6863223Z 
2025-05-07T20:26:23.6863227Z 
2025-05-07T20:26:23.6863231Z 
2025-05-07T20:26:23.6863235Z 
2025-05-07T20:26:23.6863238Z 
2025-05-07T20:26:23.6863250Z 
2025-05-07T20:26:23.6864901Z 
2025-05-07T20:26:23.7493824Z cuda-nvrtc-12.8.61   | 63.1 MB   | #####1     |  51% [A[A[A[A[A[A[A[A
2025-05-07T20:26:23.7494119Z 
2025-05-07T20:26:23.7494131Z 
2025-05-07T20:26:23.7494135Z 
2025-05-07T20:26:23.7494139Z 
2025-05-07T20:26:23.7494143Z 
2025-05-07T20:26:23.7494147Z 
2025-05-07T20:26:23.7494151Z 
2025-05-07T20:26:23.7494155Z 
2025-05-07T20:26:23.7496344Z 
2025-05-07T20:26:23.7864443Z libcurand-10.3.9.55  | 43.6 MB   | ###8       |  39% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:23.7864751Z 
2025-05-07T20:26:23.7864755Z 
2025-05-07T20:26:23.7864759Z 
2025-05-07T20:26:23.7864763Z 
2025-05-07T20:26:23.7864767Z 
2025-05-07T20:26:23.7864770Z 
2025-05-07T20:26:23.7864774Z 
2025-05-07T20:26:23.7864778Z 
2025-05-07T20:26:23.8594391Z cuda-nvrtc-12.8.61   | 63.1 MB   | #####6     |  57% [A[A[A[A[A[A[A[A
2025-05-07T20:26:23.8594795Z 
2025-05-07T20:26:23.8594799Z 
2025-05-07T20:26:23.8594803Z 
2025-05-07T20:26:23.8594807Z 
2025-05-07T20:26:23.8594811Z 
2025-05-07T20:26:23.8594815Z 
2025-05-07T20:26:23.8595077Z 
2025-05-07T20:26:23.8595083Z 
2025-05-07T20:26:23.8596373Z 
2025-05-07T20:26:23.8916267Z libcurand-10.3.9.55  | 43.6 MB   | ####6      |  46% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:23.8916691Z 
2025-05-07T20:26:23.8916697Z 
2025-05-07T20:26:23.8916703Z 
2025-05-07T20:26:23.8916708Z 
2025-05-07T20:26:23.8916715Z 
2025-05-07T20:26:23.8916720Z 
2025-05-07T20:26:23.8916747Z 
2025-05-07T20:26:23.8918340Z 
2025-05-07T20:26:23.9597081Z cuda-nvrtc-12.8.61   | 63.1 MB   | ######2    |  63% [A[A[A[A[A[A[A[A
2025-05-07T20:26:23.9597493Z 
2025-05-07T20:26:23.9597499Z 
2025-05-07T20:26:23.9597504Z 
2025-05-07T20:26:23.9597510Z 
2025-05-07T20:26:23.9597515Z 
2025-05-07T20:26:23.9597520Z 
2025-05-07T20:26:23.9597526Z 
2025-05-07T20:26:23.9597531Z 
2025-05-07T20:26:23.9601999Z 
2025-05-07T20:26:23.9916885Z libcurand-10.3.9.55  | 43.6 MB   | #####4     |  55% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:23.9917184Z 
2025-05-07T20:26:23.9917188Z 
2025-05-07T20:26:23.9917192Z 
2025-05-07T20:26:23.9917218Z 
2025-05-07T20:26:23.9917222Z 
2025-05-07T20:26:23.9917226Z 
2025-05-07T20:26:23.9917230Z 
2025-05-07T20:26:23.9918500Z 
2025-05-07T20:26:24.0597794Z cuda-nvrtc-12.8.61   | 63.1 MB   | ######8    |  68% [A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0598203Z 
2025-05-07T20:26:24.0598210Z 
2025-05-07T20:26:24.0598215Z 
2025-05-07T20:26:24.0598221Z 
2025-05-07T20:26:24.0598256Z 
2025-05-07T20:26:24.0598262Z 
2025-05-07T20:26:24.0598268Z 
2025-05-07T20:26:24.0598273Z 
2025-05-07T20:26:24.0599914Z 
2025-05-07T20:26:24.0951684Z libcurand-10.3.9.55  | 43.6 MB   | ######2    |  62% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.0951987Z 
2025-05-07T20:26:24.0951991Z 
2025-05-07T20:26:24.0951995Z 
2025-05-07T20:26:24.0951999Z 
2025-05-07T20:26:24.0952003Z 
2025-05-07T20:26:24.0952006Z 
2025-05-07T20:26:24.0952010Z 
2025-05-07T20:26:24.0952014Z 
2025-05-07T20:26:24.1635463Z cuda-nvrtc-12.8.61   | 63.1 MB   | #######3   |  74% [A[A[A[A[A[A[A[A
2025-05-07T20:26:24.1635762Z 
2025-05-07T20:26:24.1635766Z 
2025-05-07T20:26:24.1635792Z 
2025-05-07T20:26:24.1635796Z 
2025-05-07T20:26:24.1635800Z 
2025-05-07T20:26:24.1635804Z 
2025-05-07T20:26:24.1635808Z 
2025-05-07T20:26:24.1635811Z 
2025-05-07T20:26:24.1640251Z 
2025-05-07T20:26:24.1970349Z libcurand-10.3.9.55  | 43.6 MB   | #######    |  70% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.1970668Z 
2025-05-07T20:26:24.1970693Z 
2025-05-07T20:26:24.1970697Z 
2025-05-07T20:26:24.1970701Z 
2025-05-07T20:26:24.1970704Z 
2025-05-07T20:26:24.1970708Z 
2025-05-07T20:26:24.1970712Z 
2025-05-07T20:26:24.1970716Z 
2025-05-07T20:26:24.2680564Z cuda-nvrtc-12.8.61   | 63.1 MB   | #######9   |  79% [A[A[A[A[A[A[A[A
2025-05-07T20:26:24.2680876Z 
2025-05-07T20:26:24.2680880Z 
2025-05-07T20:26:24.2680884Z 
2025-05-07T20:26:24.2680888Z 
2025-05-07T20:26:24.2680892Z 
2025-05-07T20:26:24.2680896Z 
2025-05-07T20:26:24.2680899Z 
2025-05-07T20:26:24.2680903Z 
2025-05-07T20:26:24.2682679Z 
2025-05-07T20:26:24.2715815Z libcurand-10.3.9.55  | 43.6 MB   | #######7   |  78% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.2716109Z 
2025-05-07T20:26:24.2716113Z 
2025-05-07T20:26:24.2716117Z 
2025-05-07T20:26:24.2716120Z 
2025-05-07T20:26:24.2716124Z 
2025-05-07T20:26:24.2716128Z 
2025-05-07T20:26:24.2718073Z 
2025-05-07T20:26:24.3079869Z cuda-nvvp-12.8.57    | 112.4 MB  | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:26:24.3080278Z 
2025-05-07T20:26:24.3080590Z 
2025-05-07T20:26:24.3080594Z 
2025-05-07T20:26:24.3080598Z 
2025-05-07T20:26:24.3080602Z 
2025-05-07T20:26:24.3080605Z 
2025-05-07T20:26:24.3080609Z 
2025-05-07T20:26:24.3081721Z 
2025-05-07T20:26:24.3355082Z cuda-nvrtc-12.8.61   | 63.1 MB   | ########4  |  85% [A[A[A[A[A[A[A[A
2025-05-07T20:26:24.3355484Z 
2025-05-07T20:26:24.3355488Z 
2025-05-07T20:26:24.3355491Z 
2025-05-07T20:26:24.3355495Z 
2025-05-07T20:26:24.3355499Z 
2025-05-07T20:26:24.3355502Z 
2025-05-07T20:26:24.3355506Z 
2025-05-07T20:26:24.3355510Z 
2025-05-07T20:26:24.3355513Z 
2025-05-07T20:26:24.3355526Z 
2025-05-07T20:26:24.4084295Z gds-tools-1.13.0.11  | 37.9 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.4084606Z 
2025-05-07T20:26:24.4084610Z 
2025-05-07T20:26:24.4084614Z 
2025-05-07T20:26:24.4084618Z 
2025-05-07T20:26:24.4084621Z 
2025-05-07T20:26:24.4084642Z 
2025-05-07T20:26:24.4084646Z 
2025-05-07T20:26:24.4086797Z 
2025-05-07T20:26:24.4374028Z cuda-nvrtc-12.8.61   | 63.1 MB   | #########  |  91% [A[A[A[A[A[A[A[A
2025-05-07T20:26:24.4374349Z 
2025-05-07T20:26:24.4374356Z 
2025-05-07T20:26:24.4374362Z 
2025-05-07T20:26:24.4374367Z 
2025-05-07T20:26:24.4374372Z 
2025-05-07T20:26:24.4374378Z 
2025-05-07T20:26:24.4374383Z 
2025-05-07T20:26:24.4374388Z 
2025-05-07T20:26:24.4374393Z 
2025-05-07T20:26:24.4382135Z 
2025-05-07T20:26:24.5031444Z gds-tools-1.13.0.11  | 37.9 MB   | 1          |   2% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.5031799Z 
2025-05-07T20:26:24.5031804Z 
2025-05-07T20:26:24.5031808Z 
2025-05-07T20:26:24.5031813Z 
2025-05-07T20:26:24.5031818Z 
2025-05-07T20:26:24.5031822Z 
2025-05-07T20:26:24.5031845Z 
2025-05-07T20:26:24.5031849Z 
2025-05-07T20:26:24.5033137Z 
2025-05-07T20:26:24.5085205Z libcurand-10.3.9.55  | 43.6 MB   | ########5  |  85% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.5085506Z 
2025-05-07T20:26:24.5085510Z 
2025-05-07T20:26:24.5085514Z 
2025-05-07T20:26:24.5085518Z 
2025-05-07T20:26:24.5085522Z 
2025-05-07T20:26:24.5085525Z 
2025-05-07T20:26:24.5085543Z 
2025-05-07T20:26:24.5085546Z 
2025-05-07T20:26:24.5484403Z cuda-nvrtc-12.8.61   | 63.1 MB   | #########6 |  97% [A[A[A[A[A[A[A[A
2025-05-07T20:26:24.5484690Z 
2025-05-07T20:26:24.5484694Z 
2025-05-07T20:26:24.5484698Z 
2025-05-07T20:26:24.5484702Z 
2025-05-07T20:26:24.5484705Z 
2025-05-07T20:26:24.5484709Z 
2025-05-07T20:26:24.5484715Z 
2025-05-07T20:26:24.5484721Z 
2025-05-07T20:26:24.5484726Z 
2025-05-07T20:26:24.5484894Z 
2025-05-07T20:26:24.6031947Z gds-tools-1.13.0.11  | 37.9 MB   | 3          |   4% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.6032255Z 
2025-05-07T20:26:24.6032266Z 
2025-05-07T20:26:24.6032292Z 
2025-05-07T20:26:24.6032296Z 
2025-05-07T20:26:24.6032301Z 
2025-05-07T20:26:24.6032304Z 
2025-05-07T20:26:24.6032308Z 
2025-05-07T20:26:24.6032312Z 
2025-05-07T20:26:24.6032316Z 
2025-05-07T20:26:24.6485507Z libcurand-10.3.9.55  | 43.6 MB   | #########2 |  93% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.6485804Z 
2025-05-07T20:26:24.6485812Z 
2025-05-07T20:26:24.6485841Z 
2025-05-07T20:26:24.6485846Z 
2025-05-07T20:26:24.6485852Z 
2025-05-07T20:26:24.6485857Z 
2025-05-07T20:26:24.6485862Z 
2025-05-07T20:26:24.6485868Z 
2025-05-07T20:26:24.6485873Z 
2025-05-07T20:26:24.6489645Z 
2025-05-07T20:26:24.7040993Z gds-tools-1.13.0.11  | 37.9 MB   | 8          |   8% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.7041306Z 
2025-05-07T20:26:24.7041310Z 
2025-05-07T20:26:24.7041314Z 
2025-05-07T20:26:24.7041318Z 
2025-05-07T20:26:24.7041322Z 
2025-05-07T20:26:24.7041325Z 
2025-05-07T20:26:24.7041329Z 
2025-05-07T20:26:24.7041333Z 
2025-05-07T20:26:24.7042104Z 
2025-05-07T20:26:24.7489071Z libcurand-10.3.9.55  | 43.6 MB   | #########9 | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.7489455Z 
2025-05-07T20:26:24.7489459Z 
2025-05-07T20:26:24.7489463Z 
2025-05-07T20:26:24.7489477Z 
2025-05-07T20:26:24.7489481Z 
2025-05-07T20:26:24.7489485Z 
2025-05-07T20:26:24.7489489Z 
2025-05-07T20:26:24.7489493Z 
2025-05-07T20:26:24.7489497Z 
2025-05-07T20:26:24.7491274Z 
2025-05-07T20:26:24.8493737Z gds-tools-1.13.0.11  | 37.9 MB   | #6         |  16% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.8494062Z 
2025-05-07T20:26:24.8494066Z 
2025-05-07T20:26:24.8494070Z 
2025-05-07T20:26:24.8494074Z 
2025-05-07T20:26:24.8494077Z 
2025-05-07T20:26:24.8494081Z 
2025-05-07T20:26:24.8494085Z 
2025-05-07T20:26:24.8494088Z 
2025-05-07T20:26:24.8494092Z 
2025-05-07T20:26:24.8499399Z 
2025-05-07T20:26:24.9494173Z gds-tools-1.13.0.11  | 37.9 MB   | ##5        |  26% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:24.9494486Z 
2025-05-07T20:26:24.9494491Z 
2025-05-07T20:26:24.9494494Z 
2025-05-07T20:26:24.9494498Z 
2025-05-07T20:26:24.9494737Z 
2025-05-07T20:26:24.9494742Z 
2025-05-07T20:26:24.9494746Z 
2025-05-07T20:26:24.9494749Z 
2025-05-07T20:26:24.9494753Z 
2025-05-07T20:26:24.9494757Z 
2025-05-07T20:26:25.0504593Z gds-tools-1.13.0.11  | 37.9 MB   | ###4       |  34% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:25.0504887Z 
2025-05-07T20:26:25.0504891Z 
2025-05-07T20:26:25.0504918Z 
2025-05-07T20:26:25.0504922Z 
2025-05-07T20:26:25.0504925Z 
2025-05-07T20:26:25.0504929Z 
2025-05-07T20:26:25.0504933Z 
2025-05-07T20:26:25.0504936Z 
2025-05-07T20:26:25.0504940Z 
2025-05-07T20:26:25.0507875Z 
2025-05-07T20:26:25.1505808Z gds-tools-1.13.0.11  | 37.9 MB   | ####2      |  43% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:25.1506120Z 
2025-05-07T20:26:25.1506124Z 
2025-05-07T20:26:25.1506128Z 
2025-05-07T20:26:25.1506138Z 
2025-05-07T20:26:25.1506142Z 
2025-05-07T20:26:25.1506146Z 
2025-05-07T20:26:25.1506149Z 
2025-05-07T20:26:25.1506153Z 
2025-05-07T20:26:25.1506157Z 
2025-05-07T20:26:25.1507530Z 
2025-05-07T20:26:25.2506537Z gds-tools-1.13.0.11  | 37.9 MB   | #####1     |  52% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:25.2506896Z 
2025-05-07T20:26:25.2506901Z 
2025-05-07T20:26:25.2506904Z 
2025-05-07T20:26:25.2506908Z 
2025-05-07T20:26:25.2506912Z 
2025-05-07T20:26:25.2506915Z 
2025-05-07T20:26:25.2506919Z 
2025-05-07T20:26:25.2506923Z 
2025-05-07T20:26:25.2506926Z 
2025-05-07T20:26:25.2507639Z 
2025-05-07T20:26:25.3510584Z gds-tools-1.13.0.11  | 37.9 MB   | ######1    |  61% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:25.3510894Z 
2025-05-07T20:26:25.3510898Z 
2025-05-07T20:26:25.3510902Z 
2025-05-07T20:26:25.3510906Z 
2025-05-07T20:26:25.3510910Z 
2025-05-07T20:26:25.3510913Z 
2025-05-07T20:26:25.3510917Z 
2025-05-07T20:26:25.3510921Z 
2025-05-07T20:26:25.3510924Z 
2025-05-07T20:26:25.3515703Z 
2025-05-07T20:26:25.4518450Z gds-tools-1.13.0.11  | 37.9 MB   | #######    |  71% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:25.4518770Z 
2025-05-07T20:26:25.4518774Z 
2025-05-07T20:26:25.4518778Z 
2025-05-07T20:26:25.4518810Z 
2025-05-07T20:26:25.4518814Z 
2025-05-07T20:26:25.4518817Z 
2025-05-07T20:26:25.4518821Z 
2025-05-07T20:26:25.4518825Z 
2025-05-07T20:26:25.4518829Z 
2025-05-07T20:26:25.4521754Z 
2025-05-07T20:26:25.5519823Z gds-tools-1.13.0.11  | 37.9 MB   | #######9   |  80% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:25.5520130Z 
2025-05-07T20:26:25.5520134Z 
2025-05-07T20:26:25.5520164Z 
2025-05-07T20:26:25.5520168Z 
2025-05-07T20:26:25.5520171Z 
2025-05-07T20:26:25.5520175Z 
2025-05-07T20:26:25.5520179Z 
2025-05-07T20:26:25.5520182Z 
2025-05-07T20:26:25.5520186Z 
2025-05-07T20:26:25.5521561Z 
2025-05-07T20:26:25.6531503Z gds-tools-1.13.0.11  | 37.9 MB   | ########8  |  88% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:25.6531810Z 
2025-05-07T20:26:25.6531814Z 
2025-05-07T20:26:25.6531818Z 
2025-05-07T20:26:25.6531822Z 
2025-05-07T20:26:25.6531826Z 
2025-05-07T20:26:25.6531830Z 
2025-05-07T20:26:25.6531833Z 
2025-05-07T20:26:25.6531845Z 
2025-05-07T20:26:25.6531849Z 
2025-05-07T20:26:25.6533831Z 
2025-05-07T20:26:25.7385520Z gds-tools-1.13.0.11  | 37.9 MB   | #########7 |  97% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:25.7385820Z 
2025-05-07T20:26:25.7385831Z 
2025-05-07T20:26:26.1600670Z libcusparse-12.5.7.5 | 164.9 MB  | ########## | 100% [A[A
2025-05-07T20:26:26.1600979Z 
2025-05-07T20:26:26.1600983Z 
2025-05-07T20:26:26.1600987Z 
2025-05-07T20:26:26.1601278Z 
2025-05-07T20:26:26.1601284Z 
2025-05-07T20:26:26.1601289Z 
2025-05-07T20:26:26.1601294Z 
2025-05-07T20:26:26.1601299Z 
2025-05-07T20:26:26.1601304Z 
2025-05-07T20:26:26.2108743Z libcurand-10.3.9.55  | 43.6 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:26.2109063Z 
2025-05-07T20:26:26.2109067Z 
2025-05-07T20:26:26.2109071Z 
2025-05-07T20:26:26.2109074Z 
2025-05-07T20:26:26.2109078Z 
2025-05-07T20:26:26.2109082Z 
2025-05-07T20:26:26.2109086Z 
2025-05-07T20:26:26.2109089Z 
2025-05-07T20:26:26.2109101Z 
2025-05-07T20:26:26.2109105Z 
2025-05-07T20:26:26.2109431Z 
2025-05-07T20:26:26.3110848Z python-3.13.0        | 31.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:26.3111185Z 
2025-05-07T20:26:26.3111190Z 
2025-05-07T20:26:26.3111194Z 
2025-05-07T20:26:26.3111198Z 
2025-05-07T20:26:26.3111202Z 
2025-05-07T20:26:26.3111205Z 
2025-05-07T20:26:26.3111209Z 
2025-05-07T20:26:26.3111213Z 
2025-05-07T20:26:26.3111217Z 
2025-05-07T20:26:26.3111220Z 
2025-05-07T20:26:26.3115380Z 
2025-05-07T20:26:26.4114247Z python-3.13.0        | 31.5 MB   | #          |  11% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:26.4114573Z 
2025-05-07T20:26:26.4114578Z 
2025-05-07T20:26:26.4114582Z 
2025-05-07T20:26:26.4114586Z 
2025-05-07T20:26:26.4114590Z 
2025-05-07T20:26:26.4114594Z 
2025-05-07T20:26:26.4114597Z 
2025-05-07T20:26:26.4114601Z 
2025-05-07T20:26:26.4114605Z 
2025-05-07T20:26:26.4114609Z 
2025-05-07T20:26:26.4117264Z 
2025-05-07T20:26:26.5141638Z python-3.13.0        | 31.5 MB   | ##2        |  23% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:26.5141956Z 
2025-05-07T20:26:26.5141961Z 
2025-05-07T20:26:26.5141988Z 
2025-05-07T20:26:26.5141992Z 
2025-05-07T20:26:26.5141996Z 
2025-05-07T20:26:26.5141999Z 
2025-05-07T20:26:26.5142003Z 
2025-05-07T20:26:26.5142007Z 
2025-05-07T20:26:26.5142011Z 
2025-05-07T20:26:26.5142014Z 
2025-05-07T20:26:26.5144369Z 
2025-05-07T20:26:26.5371678Z python-3.13.0        | 31.5 MB   | ###3       |  34% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:26.5371976Z 
2025-05-07T20:26:26.5371980Z 
2025-05-07T20:26:26.5371983Z 
2025-05-07T20:26:26.5371997Z 
2025-05-07T20:26:26.5372001Z 
2025-05-07T20:26:26.5372005Z 
2025-05-07T20:26:26.5372008Z 
2025-05-07T20:26:26.5373185Z 
2025-05-07T20:26:26.5728386Z cuda-nvrtc-12.8.61   | 63.1 MB   | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:26:26.5728695Z 
2025-05-07T20:26:26.5728699Z 
2025-05-07T20:26:26.5728703Z 
2025-05-07T20:26:26.5728707Z 
2025-05-07T20:26:26.5728711Z 
2025-05-07T20:26:26.5733406Z 
2025-05-07T20:26:26.5979985Z cuda-nsight-12.8.55  | 113.2 MB  | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:26:26.5980348Z 
2025-05-07T20:26:26.5980355Z 
2025-05-07T20:26:26.5980360Z 
2025-05-07T20:26:26.5980365Z 
2025-05-07T20:26:26.5980371Z 
2025-05-07T20:26:26.5980376Z 
2025-05-07T20:26:26.5980381Z 
2025-05-07T20:26:26.5980386Z 
2025-05-07T20:26:26.5980391Z 
2025-05-07T20:26:26.5980396Z 
2025-05-07T20:26:26.5980402Z 
2025-05-07T20:26:26.5980409Z 
2025-05-07T20:26:26.6176814Z libnvjitlink-12.8.61 | 28.7 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:26.6177240Z 
2025-05-07T20:26:26.6177244Z 
2025-05-07T20:26:26.6177248Z 
2025-05-07T20:26:26.6177252Z 
2025-05-07T20:26:26.6177256Z 
2025-05-07T20:26:26.6177266Z 
2025-05-07T20:26:26.6177270Z 
2025-05-07T20:26:26.6177274Z 
2025-05-07T20:26:26.6177278Z 
2025-05-07T20:26:26.6177281Z 
2025-05-07T20:26:26.6177861Z 
2025-05-07T20:26:26.6984324Z python-3.13.0        | 31.5 MB   | ####5      |  45% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:26.6984730Z 
2025-05-07T20:26:26.6984736Z 
2025-05-07T20:26:26.6984741Z 
2025-05-07T20:26:26.6984747Z 
2025-05-07T20:26:26.6984770Z 
2025-05-07T20:26:26.6984777Z 
2025-05-07T20:26:26.6984782Z 
2025-05-07T20:26:26.6984787Z 
2025-05-07T20:26:26.6984793Z 
2025-05-07T20:26:26.6984798Z 
2025-05-07T20:26:26.6984803Z 
2025-05-07T20:26:26.6984809Z 
2025-05-07T20:26:26.7307228Z libnvjitlink-12.8.61 | 28.7 MB   | 6          |   6% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:26.7307930Z 
2025-05-07T20:26:26.7307935Z 
2025-05-07T20:26:26.7307938Z 
2025-05-07T20:26:26.7307942Z 
2025-05-07T20:26:26.7307946Z 
2025-05-07T20:26:26.7307950Z 
2025-05-07T20:26:26.7307954Z 
2025-05-07T20:26:26.7307957Z 
2025-05-07T20:26:26.7307961Z 
2025-05-07T20:26:26.7307965Z 
2025-05-07T20:26:26.7308344Z 
2025-05-07T20:26:26.7992388Z python-3.13.0        | 31.5 MB   | #####6     |  56% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:26.7992690Z 
2025-05-07T20:26:26.7992694Z 
2025-05-07T20:26:26.7992698Z 
2025-05-07T20:26:26.7992702Z 
2025-05-07T20:26:26.7992706Z 
2025-05-07T20:26:26.7992716Z 
2025-05-07T20:26:26.7992720Z 
2025-05-07T20:26:26.7992952Z 
2025-05-07T20:26:26.7992958Z 
2025-05-07T20:26:26.7992961Z 
2025-05-07T20:26:26.7992965Z 
2025-05-07T20:26:26.7994919Z 
2025-05-07T20:26:26.8407816Z libnvjitlink-12.8.61 | 28.7 MB   | #7         |  17% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:26.8408313Z 
2025-05-07T20:26:26.8408320Z 
2025-05-07T20:26:26.8408326Z 
2025-05-07T20:26:26.8408351Z 
2025-05-07T20:26:26.8408357Z 
2025-05-07T20:26:26.8408362Z 
2025-05-07T20:26:26.8408367Z 
2025-05-07T20:26:26.8408373Z 
2025-05-07T20:26:26.8408378Z 
2025-05-07T20:26:26.8408383Z 
2025-05-07T20:26:26.8411173Z 
2025-05-07T20:26:26.8992969Z python-3.13.0        | 31.5 MB   | ######6    |  67% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:26.8993276Z 
2025-05-07T20:26:26.8993280Z 
2025-05-07T20:26:26.8993284Z 
2025-05-07T20:26:26.8993288Z 
2025-05-07T20:26:26.8993291Z 
2025-05-07T20:26:26.8993295Z 
2025-05-07T20:26:26.8993299Z 
2025-05-07T20:26:26.8993303Z 
2025-05-07T20:26:26.8993306Z 
2025-05-07T20:26:26.8993310Z 
2025-05-07T20:26:26.8993326Z 
2025-05-07T20:26:26.8994836Z 
2025-05-07T20:26:26.9447845Z libnvjitlink-12.8.61 | 28.7 MB   | ##7        |  28% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:26.9448320Z 
2025-05-07T20:26:26.9448327Z 
2025-05-07T20:26:26.9448332Z 
2025-05-07T20:26:26.9448337Z 
2025-05-07T20:26:26.9448343Z 
2025-05-07T20:26:26.9448359Z 
2025-05-07T20:26:26.9448364Z 
2025-05-07T20:26:26.9448386Z 
2025-05-07T20:26:26.9448392Z 
2025-05-07T20:26:26.9448397Z 
2025-05-07T20:26:26.9453721Z 
2025-05-07T20:26:26.9995836Z python-3.13.0        | 31.5 MB   | #######7   |  77% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:26.9996278Z 
2025-05-07T20:26:26.9996285Z 
2025-05-07T20:26:26.9996290Z 
2025-05-07T20:26:26.9996296Z 
2025-05-07T20:26:26.9996301Z 
2025-05-07T20:26:26.9996306Z 
2025-05-07T20:26:26.9996311Z 
2025-05-07T20:26:26.9996316Z 
2025-05-07T20:26:26.9996322Z 
2025-05-07T20:26:26.9996327Z 
2025-05-07T20:26:26.9996332Z 
2025-05-07T20:26:26.9996337Z 
2025-05-07T20:26:27.0476256Z libnvjitlink-12.8.61 | 28.7 MB   | ###9       |  40% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.0476589Z 
2025-05-07T20:26:27.0476594Z 
2025-05-07T20:26:27.0476598Z 
2025-05-07T20:26:27.0476601Z 
2025-05-07T20:26:27.0476605Z 
2025-05-07T20:26:27.0476609Z 
2025-05-07T20:26:27.0476613Z 
2025-05-07T20:26:27.0476616Z 
2025-05-07T20:26:27.0476620Z 
2025-05-07T20:26:27.0476624Z 
2025-05-07T20:26:27.0481128Z 
2025-05-07T20:26:27.0563179Z python-3.13.0        | 31.5 MB   | ########7  |  87% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.0563472Z 
2025-05-07T20:26:27.0563476Z 
2025-05-07T20:26:27.0563480Z 
2025-05-07T20:26:27.0563484Z 
2025-05-07T20:26:27.0563495Z 
2025-05-07T20:26:27.0563499Z 
2025-05-07T20:26:27.0563503Z 
2025-05-07T20:26:27.0563507Z 
2025-05-07T20:26:27.0563510Z 
2025-05-07T20:26:27.0563514Z 
2025-05-07T20:26:27.1027332Z gds-tools-1.13.0.11  | 37.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.1027635Z 
2025-05-07T20:26:27.1027639Z 
2025-05-07T20:26:27.1027643Z 
2025-05-07T20:26:27.1027663Z 
2025-05-07T20:26:27.1027674Z 
2025-05-07T20:26:27.1027678Z 
2025-05-07T20:26:27.1027681Z 
2025-05-07T20:26:27.1027685Z 
2025-05-07T20:26:27.1027689Z 
2025-05-07T20:26:27.1027693Z 
2025-05-07T20:26:27.1027697Z 
2025-05-07T20:26:27.1029902Z 
2025-05-07T20:26:27.1257895Z libnvjitlink-12.8.61 | 28.7 MB   | ####9      |  50% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.1258497Z 
2025-05-07T20:26:27.1258501Z 
2025-05-07T20:26:27.1258505Z 
2025-05-07T20:26:27.1258509Z 
2025-05-07T20:26:27.1258512Z 
2025-05-07T20:26:27.1258516Z 
2025-05-07T20:26:27.1258520Z 
2025-05-07T20:26:27.1258524Z 
2025-05-07T20:26:27.1258528Z 
2025-05-07T20:26:27.1258531Z 
2025-05-07T20:26:27.1258535Z 
2025-05-07T20:26:27.1258539Z 
2025-05-07T20:26:27.1258543Z 
2025-05-07T20:26:27.1476898Z cuda-nvcc-tools-12.8 | 24.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.1477218Z 
2025-05-07T20:26:27.1477222Z 
2025-05-07T20:26:27.1477226Z 
2025-05-07T20:26:27.1477230Z 
2025-05-07T20:26:27.1477459Z 
2025-05-07T20:26:27.1477466Z 
2025-05-07T20:26:27.1477471Z 
2025-05-07T20:26:27.1477476Z 
2025-05-07T20:26:27.1477481Z 
2025-05-07T20:26:27.1477487Z 
2025-05-07T20:26:27.1477502Z 
2025-05-07T20:26:27.2033003Z python-3.13.0        | 31.5 MB   | #########7 |  97% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.2033318Z 
2025-05-07T20:26:27.2033322Z 
2025-05-07T20:26:27.2033348Z 
2025-05-07T20:26:27.2033359Z 
2025-05-07T20:26:27.2033362Z 
2025-05-07T20:26:27.2033366Z 
2025-05-07T20:26:27.2033370Z 
2025-05-07T20:26:27.2033373Z 
2025-05-07T20:26:27.2033377Z 
2025-05-07T20:26:27.2033381Z 
2025-05-07T20:26:27.2033384Z 
2025-05-07T20:26:27.2033388Z 
2025-05-07T20:26:27.2260519Z libnvjitlink-12.8.61 | 28.7 MB   | ######     |  61% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.2260968Z 
2025-05-07T20:26:27.2260974Z 
2025-05-07T20:26:27.2260980Z 
2025-05-07T20:26:27.2260985Z 
2025-05-07T20:26:27.2260991Z 
2025-05-07T20:26:27.2260996Z 
2025-05-07T20:26:27.2261001Z 
2025-05-07T20:26:27.2261024Z 
2025-05-07T20:26:27.2261030Z 
2025-05-07T20:26:27.2261036Z 
2025-05-07T20:26:27.2261041Z 
2025-05-07T20:26:27.2261046Z 
2025-05-07T20:26:27.2261051Z 
2025-05-07T20:26:27.3034757Z cuda-nvcc-tools-12.8 | 24.5 MB   | #1         |  11% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.3035178Z 
2025-05-07T20:26:27.3035182Z 
2025-05-07T20:26:27.3035186Z 
2025-05-07T20:26:27.3035212Z 
2025-05-07T20:26:27.3035216Z 
2025-05-07T20:26:27.3035220Z 
2025-05-07T20:26:27.3035223Z 
2025-05-07T20:26:27.3035227Z 
2025-05-07T20:26:27.3035231Z 
2025-05-07T20:26:27.3035234Z 
2025-05-07T20:26:27.3035247Z 
2025-05-07T20:26:27.3036851Z 
2025-05-07T20:26:27.3919584Z libnvjitlink-12.8.61 | 28.7 MB   | #######3   |  73% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.3919907Z 
2025-05-07T20:26:27.3919912Z 
2025-05-07T20:26:27.3919925Z 
2025-05-07T20:26:27.3919929Z 
2025-05-07T20:26:27.3919933Z 
2025-05-07T20:26:27.3919937Z 
2025-05-07T20:26:27.3919941Z 
2025-05-07T20:26:27.3919944Z 
2025-05-07T20:26:27.3919948Z 
2025-05-07T20:26:27.3919966Z 
2025-05-07T20:26:27.3919970Z 
2025-05-07T20:26:27.3919974Z 
2025-05-07T20:26:27.3921589Z 
2025-05-07T20:26:27.4034282Z cuda-nvcc-tools-12.8 | 24.5 MB   | ##2        |  23% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4034692Z 
2025-05-07T20:26:27.4034697Z 
2025-05-07T20:26:27.4034700Z 
2025-05-07T20:26:27.4034704Z 
2025-05-07T20:26:27.4034722Z 
2025-05-07T20:26:27.4034726Z 
2025-05-07T20:26:27.4034730Z 
2025-05-07T20:26:27.4034733Z 
2025-05-07T20:26:27.4034737Z 
2025-05-07T20:26:27.4034741Z 
2025-05-07T20:26:27.4034744Z 
2025-05-07T20:26:27.4036301Z 
2025-05-07T20:26:27.4920831Z libnvjitlink-12.8.61 | 28.7 MB   | ########4  |  84% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.4921220Z 
2025-05-07T20:26:27.4921241Z 
2025-05-07T20:26:27.4921245Z 
2025-05-07T20:26:27.4921252Z 
2025-05-07T20:26:27.4921257Z 
2025-05-07T20:26:27.4921262Z 
2025-05-07T20:26:27.4921268Z 
2025-05-07T20:26:27.4921283Z 
2025-05-07T20:26:27.4921289Z 
2025-05-07T20:26:27.4921294Z 
2025-05-07T20:26:27.4921321Z 
2025-05-07T20:26:27.4921326Z 
2025-05-07T20:26:27.4923689Z 
2025-05-07T20:26:27.5072684Z cuda-nvcc-tools-12.8 | 24.5 MB   | ###5       |  36% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.5073077Z 
2025-05-07T20:26:27.5073081Z 
2025-05-07T20:26:27.5073085Z 
2025-05-07T20:26:27.5073089Z 
2025-05-07T20:26:27.5073093Z 
2025-05-07T20:26:27.5073393Z 
2025-05-07T20:26:27.5073398Z 
2025-05-07T20:26:27.5073404Z 
2025-05-07T20:26:27.5073409Z 
2025-05-07T20:26:27.5073414Z 
2025-05-07T20:26:27.5073420Z 
2025-05-07T20:26:27.5080784Z 
2025-05-07T20:26:27.5925095Z libnvjitlink-12.8.61 | 28.7 MB   | #########5 |  95% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.5925413Z 
2025-05-07T20:26:27.5925420Z 
2025-05-07T20:26:27.5925424Z 
2025-05-07T20:26:27.5925427Z 
2025-05-07T20:26:27.5925431Z 
2025-05-07T20:26:27.5925435Z 
2025-05-07T20:26:27.5925439Z 
2025-05-07T20:26:27.5925442Z 
2025-05-07T20:26:27.5925446Z 
2025-05-07T20:26:27.5925450Z 
2025-05-07T20:26:27.5925453Z 
2025-05-07T20:26:27.5925691Z 
2025-05-07T20:26:27.5926091Z 
2025-05-07T20:26:27.6000688Z cuda-nvcc-tools-12.8 | 24.5 MB   | ####7      |  48% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.6000997Z 
2025-05-07T20:26:27.6025542Z nsight-compute-2025. | 320.6 MB  | ########## | 100% [A
2025-05-07T20:26:27.6025810Z 
2025-05-07T20:26:27.6025814Z 
2025-05-07T20:26:27.6029213Z 
2025-05-07T20:26:27.6484385Z libcusolver-11.7.2.5 | 156.9 MB  | ########## | 100% [A[A[A
2025-05-07T20:26:27.6484686Z 
2025-05-07T20:26:27.6484698Z 
2025-05-07T20:26:27.6484719Z 
2025-05-07T20:26:27.6484723Z 
2025-05-07T20:26:27.6484726Z 
2025-05-07T20:26:27.6484730Z 
2025-05-07T20:26:27.6484734Z 
2025-05-07T20:26:27.6484738Z 
2025-05-07T20:26:27.6484741Z 
2025-05-07T20:26:27.6484745Z 
2025-05-07T20:26:27.6484749Z 
2025-05-07T20:26:27.6484753Z 
2025-05-07T20:26:27.6484756Z 
2025-05-07T20:26:27.6484760Z 
2025-05-07T20:26:27.6928757Z cuda-nvvm-tools-12.8 | 23.5 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.6929217Z 
2025-05-07T20:26:27.6929222Z 
2025-05-07T20:26:27.6929226Z 
2025-05-07T20:26:27.6929230Z 
2025-05-07T20:26:27.6929234Z 
2025-05-07T20:26:27.6929238Z 
2025-05-07T20:26:27.6929243Z 
2025-05-07T20:26:27.6929247Z 
2025-05-07T20:26:27.6929250Z 
2025-05-07T20:26:27.6929254Z 
2025-05-07T20:26:27.6929258Z 
2025-05-07T20:26:27.6929262Z 
2025-05-07T20:26:27.6929277Z 
2025-05-07T20:26:27.7479313Z cuda-nvcc-tools-12.8 | 24.5 MB   | #####9     |  60% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.7479643Z 
2025-05-07T20:26:27.7479647Z 
2025-05-07T20:26:27.7479651Z 
2025-05-07T20:26:27.7479655Z 
2025-05-07T20:26:27.7479667Z 
2025-05-07T20:26:27.7479671Z 
2025-05-07T20:26:27.7479675Z 
2025-05-07T20:26:27.7479678Z 
2025-05-07T20:26:27.7479682Z 
2025-05-07T20:26:27.7479686Z 
2025-05-07T20:26:27.7479689Z 
2025-05-07T20:26:27.7479693Z 
2025-05-07T20:26:27.7479697Z 
2025-05-07T20:26:27.7481964Z 
2025-05-07T20:26:27.7929436Z cuda-nvvm-tools-12.8 | 23.5 MB   | #4         |  15% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.7929778Z 
2025-05-07T20:26:27.7929782Z 
2025-05-07T20:26:27.7929786Z 
2025-05-07T20:26:27.7929789Z 
2025-05-07T20:26:27.7929793Z 
2025-05-07T20:26:27.7929797Z 
2025-05-07T20:26:27.7929801Z 
2025-05-07T20:26:27.7929805Z 
2025-05-07T20:26:27.7929808Z 
2025-05-07T20:26:27.7929812Z 
2025-05-07T20:26:27.7929816Z 
2025-05-07T20:26:27.7929828Z 
2025-05-07T20:26:27.7931202Z 
2025-05-07T20:26:27.8551148Z cuda-nvcc-tools-12.8 | 24.5 MB   | #######2   |  72% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.8551567Z 
2025-05-07T20:26:27.8551572Z 
2025-05-07T20:26:27.8551575Z 
2025-05-07T20:26:27.8551579Z 
2025-05-07T20:26:27.8551583Z 
2025-05-07T20:26:27.8551587Z 
2025-05-07T20:26:27.8551590Z 
2025-05-07T20:26:27.8551594Z 
2025-05-07T20:26:27.8551608Z 
2025-05-07T20:26:27.8551612Z 
2025-05-07T20:26:27.8551616Z 
2025-05-07T20:26:27.8551619Z 
2025-05-07T20:26:27.8551623Z 
2025-05-07T20:26:27.8551633Z 
2025-05-07T20:26:27.8932814Z cuda-nvvm-tools-12.8 | 23.5 MB   | ##9        |  30% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.8933151Z 
2025-05-07T20:26:27.8933155Z 
2025-05-07T20:26:27.8933159Z 
2025-05-07T20:26:27.8933163Z 
2025-05-07T20:26:27.8933167Z 
2025-05-07T20:26:27.8933170Z 
2025-05-07T20:26:27.8933174Z 
2025-05-07T20:26:27.8933178Z 
2025-05-07T20:26:27.8933182Z 
2025-05-07T20:26:27.8933185Z 
2025-05-07T20:26:27.8933434Z 
2025-05-07T20:26:27.8933438Z 
2025-05-07T20:26:27.8933447Z 
2025-05-07T20:26:27.9611732Z cuda-nvcc-tools-12.8 | 24.5 MB   | ########4  |  85% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.9612085Z 
2025-05-07T20:26:27.9612089Z 
2025-05-07T20:26:27.9612093Z 
2025-05-07T20:26:27.9612096Z 
2025-05-07T20:26:27.9612100Z 
2025-05-07T20:26:27.9612104Z 
2025-05-07T20:26:27.9612107Z 
2025-05-07T20:26:27.9612111Z 
2025-05-07T20:26:27.9612115Z 
2025-05-07T20:26:27.9612118Z 
2025-05-07T20:26:27.9612122Z 
2025-05-07T20:26:27.9612126Z 
2025-05-07T20:26:27.9612130Z 
2025-05-07T20:26:27.9612133Z 
2025-05-07T20:26:27.9974355Z cuda-nvvm-tools-12.8 | 23.5 MB   | ####3      |  44% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:27.9974682Z 
2025-05-07T20:26:27.9974686Z 
2025-05-07T20:26:27.9974690Z 
2025-05-07T20:26:27.9974694Z 
2025-05-07T20:26:27.9974698Z 
2025-05-07T20:26:27.9974710Z 
2025-05-07T20:26:27.9974723Z 
2025-05-07T20:26:27.9974727Z 
2025-05-07T20:26:27.9974731Z 
2025-05-07T20:26:27.9974744Z 
2025-05-07T20:26:27.9974747Z 
2025-05-07T20:26:27.9974751Z 
2025-05-07T20:26:27.9974755Z 
2025-05-07T20:26:28.0703667Z cuda-nvcc-tools-12.8 | 24.5 MB   | #########6 |  97% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.0704135Z 
2025-05-07T20:26:28.0704139Z 
2025-05-07T20:26:28.0704143Z 
2025-05-07T20:26:28.0704147Z 
2025-05-07T20:26:28.0704151Z 
2025-05-07T20:26:28.0704154Z 
2025-05-07T20:26:28.0704158Z 
2025-05-07T20:26:28.0704162Z 
2025-05-07T20:26:28.0704166Z 
2025-05-07T20:26:28.0704169Z 
2025-05-07T20:26:28.0704173Z 
2025-05-07T20:26:28.0704177Z 
2025-05-07T20:26:28.0704181Z 
2025-05-07T20:26:28.0704184Z 
2025-05-07T20:26:28.1705585Z cuda-nvvm-tools-12.8 | 23.5 MB   | #####7     |  58% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.1705928Z 
2025-05-07T20:26:28.1705932Z 
2025-05-07T20:26:28.1705936Z 
2025-05-07T20:26:28.1705940Z 
2025-05-07T20:26:28.1705943Z 
2025-05-07T20:26:28.1705947Z 
2025-05-07T20:26:28.1705951Z 
2025-05-07T20:26:28.1705954Z 
2025-05-07T20:26:28.1705972Z 
2025-05-07T20:26:28.1705983Z 
2025-05-07T20:26:28.1705987Z 
2025-05-07T20:26:28.1705991Z 
2025-05-07T20:26:28.1705995Z 
2025-05-07T20:26:28.1707596Z 
2025-05-07T20:26:28.2707494Z cuda-nvvm-tools-12.8 | 23.5 MB   | #######3   |  73% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.2708090Z 
2025-05-07T20:26:28.2708094Z 
2025-05-07T20:26:28.2708099Z 
2025-05-07T20:26:28.2708103Z 
2025-05-07T20:26:28.2708107Z 
2025-05-07T20:26:28.2708110Z 
2025-05-07T20:26:28.2708114Z 
2025-05-07T20:26:28.2708118Z 
2025-05-07T20:26:28.2708122Z 
2025-05-07T20:26:28.2708125Z 
2025-05-07T20:26:28.2708129Z 
2025-05-07T20:26:28.2708153Z 
2025-05-07T20:26:28.2708157Z 
2025-05-07T20:26:28.2708188Z 
2025-05-07T20:26:28.3388970Z cuda-nvvm-tools-12.8 | 23.5 MB   | ########8  |  89% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.3389354Z 
2025-05-07T20:26:28.3389359Z 
2025-05-07T20:26:28.3389363Z 
2025-05-07T20:26:28.3389366Z 
2025-05-07T20:26:28.3389370Z 
2025-05-07T20:26:28.3389392Z 
2025-05-07T20:26:28.3389396Z 
2025-05-07T20:26:28.3389400Z 
2025-05-07T20:26:28.3389403Z 
2025-05-07T20:26:28.3389407Z 
2025-05-07T20:26:28.3390158Z 
2025-05-07T20:26:28.3799738Z python-3.13.0        | 31.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.3800137Z 
2025-05-07T20:26:28.3800143Z 
2025-05-07T20:26:28.3800159Z 
2025-05-07T20:26:28.3800165Z 
2025-05-07T20:26:28.3800170Z 
2025-05-07T20:26:28.3800175Z 
2025-05-07T20:26:28.3800180Z 
2025-05-07T20:26:28.3800186Z 
2025-05-07T20:26:28.3800191Z 
2025-05-07T20:26:28.3800196Z 
2025-05-07T20:26:28.3800201Z 
2025-05-07T20:26:28.3800207Z 
2025-05-07T20:26:28.3800227Z 
2025-05-07T20:26:28.3800233Z 
2025-05-07T20:26:28.3801522Z 
2025-05-07T20:26:28.4804073Z cuda-nvvm-impl-12.8. | 20.8 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.4804523Z 
2025-05-07T20:26:28.4804529Z 
2025-05-07T20:26:28.4804534Z 
2025-05-07T20:26:28.4804539Z 
2025-05-07T20:26:28.4804546Z 
2025-05-07T20:26:28.4804853Z 
2025-05-07T20:26:28.4804859Z 
2025-05-07T20:26:28.4804863Z 
2025-05-07T20:26:28.4804867Z 
2025-05-07T20:26:28.4804870Z 
2025-05-07T20:26:28.4804874Z 
2025-05-07T20:26:28.4804878Z 
2025-05-07T20:26:28.4804882Z 
2025-05-07T20:26:28.4804885Z 
2025-05-07T20:26:28.4806722Z 
2025-05-07T20:26:28.4923637Z cuda-nvvm-impl-12.8. | 20.8 MB   | #6         |  16% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.4924071Z 
2025-05-07T20:26:28.4924077Z 
2025-05-07T20:26:28.4924083Z 
2025-05-07T20:26:28.4924088Z 
2025-05-07T20:26:28.4924093Z 
2025-05-07T20:26:28.4924108Z 
2025-05-07T20:26:28.4924113Z 
2025-05-07T20:26:28.4924119Z 
2025-05-07T20:26:28.4924369Z 
2025-05-07T20:26:28.4924376Z 
2025-05-07T20:26:28.4924381Z 
2025-05-07T20:26:28.4925077Z 
2025-05-07T20:26:28.5621983Z libnvjitlink-12.8.61 | 28.7 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.5622436Z 
2025-05-07T20:26:28.5622441Z 
2025-05-07T20:26:28.5622447Z 
2025-05-07T20:26:28.5622452Z 
2025-05-07T20:26:28.5622477Z 
2025-05-07T20:26:28.5622483Z 
2025-05-07T20:26:28.5622488Z 
2025-05-07T20:26:28.5622493Z 
2025-05-07T20:26:28.5622498Z 
2025-05-07T20:26:28.5622504Z 
2025-05-07T20:26:28.5622509Z 
2025-05-07T20:26:28.5622514Z 
2025-05-07T20:26:28.5622519Z 
2025-05-07T20:26:28.5622525Z 
2025-05-07T20:26:28.5622530Z 
2025-05-07T20:26:28.5622535Z 
2025-05-07T20:26:28.5868515Z cuda-nvcc-dev_linux- | 12.7 MB   |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.5868999Z 
2025-05-07T20:26:28.5869003Z 
2025-05-07T20:26:28.5869007Z 
2025-05-07T20:26:28.5869011Z 
2025-05-07T20:26:28.5869014Z 
2025-05-07T20:26:28.5869018Z 
2025-05-07T20:26:28.5869036Z 
2025-05-07T20:26:28.5869056Z 
2025-05-07T20:26:28.5869060Z 
2025-05-07T20:26:28.5869064Z 
2025-05-07T20:26:28.5869068Z 
2025-05-07T20:26:28.5869071Z 
2025-05-07T20:26:28.5869075Z 
2025-05-07T20:26:28.5869079Z 
2025-05-07T20:26:28.5869083Z 
2025-05-07T20:26:28.6623935Z cuda-nvvm-impl-12.8. | 20.8 MB   | ###2       |  32% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.6624386Z 
2025-05-07T20:26:28.6624390Z 
2025-05-07T20:26:28.6624394Z 
2025-05-07T20:26:28.6624398Z 
2025-05-07T20:26:28.6624402Z 
2025-05-07T20:26:28.6624406Z 
2025-05-07T20:26:28.6624409Z 
2025-05-07T20:26:28.6624413Z 
2025-05-07T20:26:28.6624417Z 
2025-05-07T20:26:28.6624421Z 
2025-05-07T20:26:28.6624425Z 
2025-05-07T20:26:28.6624428Z 
2025-05-07T20:26:28.6624432Z 
2025-05-07T20:26:28.6624436Z 
2025-05-07T20:26:28.6624440Z 
2025-05-07T20:26:28.6627875Z 
2025-05-07T20:26:28.7013972Z cuda-nvcc-dev_linux- | 12.7 MB   | #6         |  16% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.7014340Z 
2025-05-07T20:26:28.7014346Z 
2025-05-07T20:26:28.7014351Z 
2025-05-07T20:26:28.7014356Z 
2025-05-07T20:26:28.7014362Z 
2025-05-07T20:26:28.7014368Z 
2025-05-07T20:26:28.7014374Z 
2025-05-07T20:26:28.7014379Z 
2025-05-07T20:26:28.7014393Z 
2025-05-07T20:26:28.7014397Z 
2025-05-07T20:26:28.7014402Z 
2025-05-07T20:26:28.7014406Z 
2025-05-07T20:26:28.7014411Z 
2025-05-07T20:26:28.7014429Z 
2025-05-07T20:26:28.7014437Z 
2025-05-07T20:26:28.7628223Z cuda-nvvm-impl-12.8. | 20.8 MB   | ####7      |  48% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.7628639Z 
2025-05-07T20:26:28.7628646Z 
2025-05-07T20:26:28.7628651Z 
2025-05-07T20:26:28.7628657Z 
2025-05-07T20:26:28.7628662Z 
2025-05-07T20:26:28.7628670Z 
2025-05-07T20:26:28.7628675Z 
2025-05-07T20:26:28.7628681Z 
2025-05-07T20:26:28.7628686Z 
2025-05-07T20:26:28.7628692Z 
2025-05-07T20:26:28.7628697Z 
2025-05-07T20:26:28.7628703Z 
2025-05-07T20:26:28.7628708Z 
2025-05-07T20:26:28.7628713Z 
2025-05-07T20:26:28.7628720Z 
2025-05-07T20:26:28.7630726Z 
2025-05-07T20:26:28.8158787Z cuda-nvcc-dev_linux- | 12.7 MB   | ###6       |  37% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.8159242Z 
2025-05-07T20:26:28.8159247Z 
2025-05-07T20:26:28.8159251Z 
2025-05-07T20:26:28.8159255Z 
2025-05-07T20:26:28.8159259Z 
2025-05-07T20:26:28.8159263Z 
2025-05-07T20:26:28.8159276Z 
2025-05-07T20:26:28.8159537Z 
2025-05-07T20:26:28.8159541Z 
2025-05-07T20:26:28.8159545Z 
2025-05-07T20:26:28.8159549Z 
2025-05-07T20:26:28.8159553Z 
2025-05-07T20:26:28.8159556Z 
2025-05-07T20:26:28.8159560Z 
2025-05-07T20:26:28.8159564Z 
2025-05-07T20:26:28.8640969Z cuda-nvvm-impl-12.8. | 20.8 MB   | ######2    |  62% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.8641316Z 
2025-05-07T20:26:28.8641321Z 
2025-05-07T20:26:28.8641325Z 
2025-05-07T20:26:28.8641328Z 
2025-05-07T20:26:28.8641336Z 
2025-05-07T20:26:28.8641341Z 
2025-05-07T20:26:28.8641344Z 
2025-05-07T20:26:28.8641348Z 
2025-05-07T20:26:28.8641352Z 
2025-05-07T20:26:28.8641356Z 
2025-05-07T20:26:28.8641608Z 
2025-05-07T20:26:28.8641613Z 
2025-05-07T20:26:28.8641617Z 
2025-05-07T20:26:28.8641621Z 
2025-05-07T20:26:28.8641625Z 
2025-05-07T20:26:28.8642433Z 
2025-05-07T20:26:28.9160379Z cuda-nvcc-dev_linux- | 12.7 MB   | #####8     |  58% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.9160717Z 
2025-05-07T20:26:28.9160722Z 
2025-05-07T20:26:28.9160758Z 
2025-05-07T20:26:28.9160762Z 
2025-05-07T20:26:28.9160766Z 
2025-05-07T20:26:28.9160770Z 
2025-05-07T20:26:28.9160774Z 
2025-05-07T20:26:28.9160792Z 
2025-05-07T20:26:28.9160796Z 
2025-05-07T20:26:28.9160799Z 
2025-05-07T20:26:28.9160803Z 
2025-05-07T20:26:28.9160808Z 
2025-05-07T20:26:28.9160812Z 
2025-05-07T20:26:28.9160815Z 
2025-05-07T20:26:28.9167566Z 
2025-05-07T20:26:28.9270196Z cuda-nvvm-impl-12.8. | 20.8 MB   | #######6   |  77% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.9270530Z 
2025-05-07T20:26:28.9270534Z 
2025-05-07T20:26:28.9270547Z 
2025-05-07T20:26:28.9270551Z 
2025-05-07T20:26:28.9270577Z 
2025-05-07T20:26:28.9270582Z 
2025-05-07T20:26:28.9270586Z 
2025-05-07T20:26:28.9270590Z 
2025-05-07T20:26:28.9270593Z 
2025-05-07T20:26:28.9270597Z 
2025-05-07T20:26:28.9270601Z 
2025-05-07T20:26:28.9270604Z 
2025-05-07T20:26:28.9270609Z 
2025-05-07T20:26:28.9645039Z cuda-nvcc-tools-12.8 | 24.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.9645406Z 
2025-05-07T20:26:28.9645412Z 
2025-05-07T20:26:28.9645417Z 
2025-05-07T20:26:28.9645423Z 
2025-05-07T20:26:28.9645428Z 
2025-05-07T20:26:28.9645445Z 
2025-05-07T20:26:28.9645450Z 
2025-05-07T20:26:28.9645456Z 
2025-05-07T20:26:28.9645461Z 
2025-05-07T20:26:28.9645479Z 
2025-05-07T20:26:28.9645484Z 
2025-05-07T20:26:28.9645490Z 
2025-05-07T20:26:28.9645495Z 
2025-05-07T20:26:28.9645500Z 
2025-05-07T20:26:28.9645505Z 
2025-05-07T20:26:28.9648218Z 
2025-05-07T20:26:28.9939223Z cuda-nvcc-dev_linux- | 12.7 MB   | ########3  |  84% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:28.9939573Z 
2025-05-07T20:26:28.9939607Z 
2025-05-07T20:26:28.9939612Z 
2025-05-07T20:26:28.9939616Z 
2025-05-07T20:26:28.9939620Z 
2025-05-07T20:26:28.9939624Z 
2025-05-07T20:26:28.9939627Z 
2025-05-07T20:26:28.9939631Z 
2025-05-07T20:26:28.9939635Z 
2025-05-07T20:26:28.9939639Z 
2025-05-07T20:26:28.9939644Z 
2025-05-07T20:26:28.9939648Z 
2025-05-07T20:26:28.9939651Z 
2025-05-07T20:26:28.9939669Z 
2025-05-07T20:26:28.9939677Z 
2025-05-07T20:26:28.9939681Z 
2025-05-07T20:26:28.9939684Z 
2025-05-07T20:26:29.0162484Z cuda-sanitizer-api-1 | 8.8 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.0162838Z 
2025-05-07T20:26:29.0162842Z 
2025-05-07T20:26:29.0162846Z 
2025-05-07T20:26:29.0162850Z 
2025-05-07T20:26:29.0162853Z 
2025-05-07T20:26:29.0162857Z 
2025-05-07T20:26:29.0162861Z 
2025-05-07T20:26:29.0162865Z 
2025-05-07T20:26:29.0162878Z 
2025-05-07T20:26:29.0162882Z 
2025-05-07T20:26:29.0162886Z 
2025-05-07T20:26:29.0162890Z 
2025-05-07T20:26:29.0162894Z 
2025-05-07T20:26:29.0162898Z 
2025-05-07T20:26:29.0163212Z 
2025-05-07T20:26:29.0942703Z cuda-nvvm-impl-12.8. | 20.8 MB   | #########1 |  92% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.0943118Z 
2025-05-07T20:26:29.0943123Z 
2025-05-07T20:26:29.0943127Z 
2025-05-07T20:26:29.0943132Z 
2025-05-07T20:26:29.0943137Z 
2025-05-07T20:26:29.0943141Z 
2025-05-07T20:26:29.0943144Z 
2025-05-07T20:26:29.0943421Z 
2025-05-07T20:26:29.0943440Z 
2025-05-07T20:26:29.0943444Z 
2025-05-07T20:26:29.0943448Z 
2025-05-07T20:26:29.0943452Z 
2025-05-07T20:26:29.0943456Z 
2025-05-07T20:26:29.0943459Z 
2025-05-07T20:26:29.0943463Z 
2025-05-07T20:26:29.0943467Z 
2025-05-07T20:26:29.0943471Z 
2025-05-07T20:26:29.1942713Z cuda-sanitizer-api-1 | 8.8 MB    | ###1       |  31% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.1943064Z 
2025-05-07T20:26:29.1943068Z 
2025-05-07T20:26:29.1943072Z 
2025-05-07T20:26:29.1943078Z 
2025-05-07T20:26:29.1943082Z 
2025-05-07T20:26:29.1943100Z 
2025-05-07T20:26:29.1943104Z 
2025-05-07T20:26:29.1943363Z 
2025-05-07T20:26:29.1943369Z 
2025-05-07T20:26:29.1943373Z 
2025-05-07T20:26:29.1943377Z 
2025-05-07T20:26:29.1943381Z 
2025-05-07T20:26:29.1943384Z 
2025-05-07T20:26:29.1947998Z 
2025-05-07T20:26:29.1956986Z cuda-nvvm-tools-12.8 | 23.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.1957330Z 
2025-05-07T20:26:29.1957367Z 
2025-05-07T20:26:29.1957373Z 
2025-05-07T20:26:29.1957401Z 
2025-05-07T20:26:29.1957416Z 
2025-05-07T20:26:29.1957422Z 
2025-05-07T20:26:29.1957427Z 
2025-05-07T20:26:29.1957433Z 
2025-05-07T20:26:29.1957439Z 
2025-05-07T20:26:29.1957445Z 
2025-05-07T20:26:29.1957451Z 
2025-05-07T20:26:29.1957455Z 
2025-05-07T20:26:29.1957460Z 
2025-05-07T20:26:29.1957463Z 
2025-05-07T20:26:29.1957467Z 
2025-05-07T20:26:29.1957471Z 
2025-05-07T20:26:29.1960114Z 
2025-05-07T20:26:29.2476680Z cuda-sanitizer-api-1 | 8.8 MB    | #######    |  70% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.2477063Z 
2025-05-07T20:26:29.2477105Z 
2025-05-07T20:26:29.2477111Z 
2025-05-07T20:26:29.2477118Z 
2025-05-07T20:26:29.2477126Z 
2025-05-07T20:26:29.2477135Z 
2025-05-07T20:26:29.2477144Z 
2025-05-07T20:26:29.2477152Z 
2025-05-07T20:26:29.2477159Z 
2025-05-07T20:26:29.2477165Z 
2025-05-07T20:26:29.2477170Z 
2025-05-07T20:26:29.2477175Z 
2025-05-07T20:26:29.2477180Z 
2025-05-07T20:26:29.2477186Z 
2025-05-07T20:26:29.2477218Z 
2025-05-07T20:26:29.2477225Z 
2025-05-07T20:26:29.2477230Z 
2025-05-07T20:26:29.2477242Z 
2025-05-07T20:26:29.3485814Z cuda-nvdisasm-12.8.5 | 4.9 MB    |            |   0% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.3486242Z 
2025-05-07T20:26:29.3486263Z 
2025-05-07T20:26:29.3486270Z 
2025-05-07T20:26:29.3486276Z 
2025-05-07T20:26:29.3486283Z 
2025-05-07T20:26:29.3486288Z 
2025-05-07T20:26:29.3486293Z 
2025-05-07T20:26:29.3486298Z 
2025-05-07T20:26:29.3486304Z 
2025-05-07T20:26:29.3486309Z 
2025-05-07T20:26:29.3486315Z 
2025-05-07T20:26:29.3486320Z 
2025-05-07T20:26:29.3486325Z 
2025-05-07T20:26:29.3486359Z 
2025-05-07T20:26:29.3486365Z 
2025-05-07T20:26:29.3486371Z 
2025-05-07T20:26:29.3486376Z 
2025-05-07T20:26:29.3486381Z 
2025-05-07T20:26:29.5029734Z cuda-nvdisasm-12.8.5 | 4.9 MB    | #####7     |  58% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.5030072Z 
2025-05-07T20:26:29.5030096Z 
2025-05-07T20:26:29.5030100Z 
2025-05-07T20:26:29.5030119Z 
2025-05-07T20:26:29.5030124Z 
2025-05-07T20:26:29.5030136Z 
2025-05-07T20:26:29.5030140Z 
2025-05-07T20:26:29.5030144Z 
2025-05-07T20:26:29.5030149Z 
2025-05-07T20:26:29.5030153Z 
2025-05-07T20:26:29.5030157Z 
2025-05-07T20:26:29.5030160Z 
2025-05-07T20:26:29.5030164Z 
2025-05-07T20:26:29.5030168Z 
2025-05-07T20:26:29.5030171Z 
2025-05-07T20:26:29.5030175Z 
2025-05-07T20:26:29.5313486Z cuda-nvcc-dev_linux- | 12.7 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.5313838Z 
2025-05-07T20:26:29.5313842Z 
2025-05-07T20:26:29.5313846Z 
2025-05-07T20:26:29.5313849Z 
2025-05-07T20:26:29.5313864Z 
2025-05-07T20:26:29.5313868Z 
2025-05-07T20:26:29.5313871Z 
2025-05-07T20:26:29.5313875Z 
2025-05-07T20:26:29.5313879Z 
2025-05-07T20:26:29.5313882Z 
2025-05-07T20:26:29.5313886Z 
2025-05-07T20:26:29.5313890Z 
2025-05-07T20:26:29.5313893Z 
2025-05-07T20:26:29.5313897Z 
2025-05-07T20:26:29.5313901Z 
2025-05-07T20:26:29.5313912Z 
2025-05-07T20:26:29.5314129Z 
2025-05-07T20:26:29.5609664Z cuda-sanitizer-api-1 | 8.8 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.5610007Z 
2025-05-07T20:26:29.5610020Z 
2025-05-07T20:26:29.5610024Z 
2025-05-07T20:26:29.5610028Z 
2025-05-07T20:26:29.5610032Z 
2025-05-07T20:26:29.5610035Z 
2025-05-07T20:26:29.5610039Z 
2025-05-07T20:26:29.5610043Z 
2025-05-07T20:26:29.5610046Z 
2025-05-07T20:26:29.5610050Z 
2025-05-07T20:26:29.5610054Z 
2025-05-07T20:26:29.5610058Z 
2025-05-07T20:26:29.5610061Z 
2025-05-07T20:26:29.5610065Z 
2025-05-07T20:26:29.5610069Z 
2025-05-07T20:26:29.5610072Z 
2025-05-07T20:26:29.5610076Z 
2025-05-07T20:26:29.5611461Z 
2025-05-07T20:26:29.5653329Z cuda-nvdisasm-12.8.5 | 4.9 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.5653678Z 
2025-05-07T20:26:29.5653683Z 
2025-05-07T20:26:29.5653686Z 
2025-05-07T20:26:29.5653690Z 
2025-05-07T20:26:29.5653694Z 
2025-05-07T20:26:29.5653698Z 
2025-05-07T20:26:29.5653701Z 
2025-05-07T20:26:29.5653715Z 
2025-05-07T20:26:29.5653719Z 
2025-05-07T20:26:29.5653730Z 
2025-05-07T20:26:29.5653734Z 
2025-05-07T20:26:29.5653737Z 
2025-05-07T20:26:29.5653741Z 
2025-05-07T20:26:29.5653745Z 
2025-05-07T20:26:29.5653748Z 
2025-05-07T20:26:29.5653752Z 
2025-05-07T20:26:29.5653756Z 
2025-05-07T20:26:29.5653759Z 
2025-05-07T20:26:29.5653763Z 
2025-05-07T20:26:29.6654732Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.6655045Z 
2025-05-07T20:26:29.6655050Z 
2025-05-07T20:26:29.6655053Z 
2025-05-07T20:26:29.6655066Z 
2025-05-07T20:26:29.6655070Z 
2025-05-07T20:26:29.6655074Z 
2025-05-07T20:26:29.6655106Z 
2025-05-07T20:26:29.6655110Z 
2025-05-07T20:26:29.6655114Z 
2025-05-07T20:26:29.6655118Z 
2025-05-07T20:26:29.6655121Z 
2025-05-07T20:26:29.6655126Z 
2025-05-07T20:26:29.6655129Z 
2025-05-07T20:26:29.6655133Z 
2025-05-07T20:26:29.6655137Z 
2025-05-07T20:26:29.6655140Z 
2025-05-07T20:26:29.6655144Z 
2025-05-07T20:26:29.6655148Z 
2025-05-07T20:26:29.6655162Z 
2025-05-07T20:26:29.7349216Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.7349607Z 
2025-05-07T20:26:29.7349611Z 
2025-05-07T20:26:29.7349615Z 
2025-05-07T20:26:29.7349619Z 
2025-05-07T20:26:29.7349623Z 
2025-05-07T20:26:29.7349626Z 
2025-05-07T20:26:29.7349630Z 
2025-05-07T20:26:29.7349634Z 
2025-05-07T20:26:29.7349638Z 
2025-05-07T20:26:29.7349641Z 
2025-05-07T20:26:29.7349645Z 
2025-05-07T20:26:29.7349649Z 
2025-05-07T20:26:29.7349653Z 
2025-05-07T20:26:29.7349666Z 
2025-05-07T20:26:29.7351781Z 
2025-05-07T20:26:29.8284837Z cuda-nvvm-impl-12.8. | 20.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:29.8285200Z 
2025-05-07T20:26:29.8285205Z 
2025-05-07T20:26:29.8285209Z 
2025-05-07T20:26:29.8285213Z 
2025-05-07T20:26:29.8285217Z 
2025-05-07T20:26:29.8285221Z 
2025-05-07T20:26:29.8285224Z 
2025-05-07T20:26:29.8285228Z 
2025-05-07T20:26:29.8285232Z 
2025-05-07T20:26:29.8285236Z 
2025-05-07T20:26:29.8285253Z 
2025-05-07T20:26:29.8285257Z 
2025-05-07T20:26:29.8285260Z 
2025-05-07T20:26:29.8285264Z 
2025-05-07T20:26:29.8285268Z 
2025-05-07T20:26:29.8285271Z 
2025-05-07T20:26:29.8285275Z 
2025-05-07T20:26:29.8285279Z 
2025-05-07T20:26:29.8295919Z 
2025-05-07T20:26:30.9641309Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:30.9641636Z 
2025-05-07T20:26:30.9641640Z 
2025-05-07T20:26:30.9641644Z 
2025-05-07T20:26:30.9641648Z 
2025-05-07T20:26:30.9641662Z 
2025-05-07T20:26:30.9641667Z 
2025-05-07T20:26:30.9641670Z 
2025-05-07T20:26:30.9641674Z 
2025-05-07T20:26:30.9641678Z 
2025-05-07T20:26:31.8169205Z libcurand-10.3.9.55  | 43.6 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:31.8169635Z 
2025-05-07T20:26:31.8169641Z 
2025-05-07T20:26:31.8169646Z 
2025-05-07T20:26:31.8169651Z 
2025-05-07T20:26:31.8169657Z 
2025-05-07T20:26:31.8169662Z 
2025-05-07T20:26:31.8170888Z 
2025-05-07T20:26:32.1525800Z cuda-nvvp-12.8.57    | 112.4 MB  | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:26:32.1996756Z libcublas-12.8.3.14  | 460.2 MB  | ########## | 100% 
2025-05-07T20:26:32.1997041Z 
2025-05-07T20:26:32.1997045Z 
2025-05-07T20:26:32.1997050Z 
2025-05-07T20:26:32.1997053Z 
2025-05-07T20:26:32.1997057Z 
2025-05-07T20:26:32.1997061Z 
2025-05-07T20:26:32.1997065Z 
2025-05-07T20:26:32.1997077Z 
2025-05-07T20:26:32.1997081Z 
2025-05-07T20:26:32.1997085Z 
2025-05-07T20:26:32.2429570Z gds-tools-1.13.0.11  | 37.9 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:32.2429882Z 
2025-05-07T20:26:32.2429897Z 
2025-05-07T20:26:32.2429901Z 
2025-05-07T20:26:32.2429906Z 
2025-05-07T20:26:32.2430155Z 
2025-05-07T20:26:32.7892649Z libnpp-12.3.3.65     | 130.6 MB  | ########## | 100% [A[A[A[A[A
2025-05-07T20:26:32.7893245Z 
2025-05-07T20:26:32.7893253Z 
2025-05-07T20:26:32.7893260Z 
2025-05-07T20:26:32.7893268Z 
2025-05-07T20:26:32.7893275Z 
2025-05-07T20:26:32.7893283Z 
2025-05-07T20:26:32.7893291Z 
2025-05-07T20:26:32.7893298Z 
2025-05-07T20:26:33.3471034Z cuda-nvrtc-12.8.61   | 63.1 MB   | ########## | 100% [A[A[A[A[A[A[A[A
2025-05-07T20:26:33.3471382Z 
2025-05-07T20:26:33.3471386Z 
2025-05-07T20:26:33.3471390Z 
2025-05-07T20:26:33.3471394Z 
2025-05-07T20:26:33.3471398Z 
2025-05-07T20:26:33.3471402Z 
2025-05-07T20:26:33.3471405Z 
2025-05-07T20:26:33.3471409Z 
2025-05-07T20:26:33.3471413Z 
2025-05-07T20:26:33.3471417Z 
2025-05-07T20:26:33.3471420Z 
2025-05-07T20:26:33.3471424Z 
2025-05-07T20:26:33.6255183Z libnvjitlink-12.8.61 | 28.7 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:33.6255530Z 
2025-05-07T20:26:33.6255535Z 
2025-05-07T20:26:33.6255569Z 
2025-05-07T20:26:33.6255573Z 
2025-05-07T20:26:33.6255577Z 
2025-05-07T20:26:33.6255580Z 
2025-05-07T20:26:33.6255584Z 
2025-05-07T20:26:33.6255588Z 
2025-05-07T20:26:33.6255592Z 
2025-05-07T20:26:33.6255596Z 
2025-05-07T20:26:33.6255600Z 
2025-05-07T20:26:33.7528184Z python-3.13.0        | 31.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:33.7528664Z 
2025-05-07T20:26:33.7528670Z 
2025-05-07T20:26:33.7528675Z 
2025-05-07T20:26:33.7528681Z 
2025-05-07T20:26:33.7528686Z 
2025-05-07T20:26:33.7528702Z 
2025-05-07T20:26:33.7528706Z 
2025-05-07T20:26:33.7528709Z 
2025-05-07T20:26:33.7528713Z 
2025-05-07T20:26:33.7528717Z 
2025-05-07T20:26:33.7528720Z 
2025-05-07T20:26:33.7528724Z 
2025-05-07T20:26:33.7528728Z 
2025-05-07T20:26:33.9849266Z cuda-nvcc-tools-12.8 | 24.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:33.9849602Z 
2025-05-07T20:26:33.9849607Z 
2025-05-07T20:26:33.9849610Z 
2025-05-07T20:26:33.9849614Z 
2025-05-07T20:26:33.9849618Z 
2025-05-07T20:26:33.9849649Z 
2025-05-07T20:26:33.9849654Z 
2025-05-07T20:26:33.9849658Z 
2025-05-07T20:26:33.9849661Z 
2025-05-07T20:26:33.9849665Z 
2025-05-07T20:26:33.9849669Z 
2025-05-07T20:26:33.9849672Z 
2025-05-07T20:26:33.9849676Z 
2025-05-07T20:26:33.9849683Z 
2025-05-07T20:26:34.0260684Z cuda-nvvm-tools-12.8 | 23.5 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:34.0261049Z 
2025-05-07T20:26:34.0261053Z 
2025-05-07T20:26:34.0261057Z 
2025-05-07T20:26:34.0261060Z 
2025-05-07T20:26:34.0261064Z 
2025-05-07T20:26:34.0261068Z 
2025-05-07T20:26:34.0261072Z 
2025-05-07T20:26:34.0261075Z 
2025-05-07T20:26:34.0261079Z 
2025-05-07T20:26:34.0261091Z 
2025-05-07T20:26:34.0261095Z 
2025-05-07T20:26:34.0261099Z 
2025-05-07T20:26:34.0261102Z 
2025-05-07T20:26:34.0261106Z 
2025-05-07T20:26:34.0261110Z 
2025-05-07T20:26:34.0261113Z 
2025-05-07T20:26:34.0610110Z cuda-nvcc-dev_linux- | 12.7 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:34.0610476Z 
2025-05-07T20:26:34.0610480Z 
2025-05-07T20:26:34.0610484Z 
2025-05-07T20:26:34.0610488Z 
2025-05-07T20:26:34.0610492Z 
2025-05-07T20:26:34.0610495Z 
2025-05-07T20:26:34.0610499Z 
2025-05-07T20:26:34.0610503Z 
2025-05-07T20:26:34.0610506Z 
2025-05-07T20:26:34.0610510Z 
2025-05-07T20:26:34.0610514Z 
2025-05-07T20:26:34.0610518Z 
2025-05-07T20:26:34.0610781Z 
2025-05-07T20:26:34.0610785Z 
2025-05-07T20:26:34.0610789Z 
2025-05-07T20:26:34.0610792Z 
2025-05-07T20:26:34.0610796Z 
2025-05-07T20:26:34.0610810Z 
2025-05-07T20:26:34.1882682Z cuda-nvdisasm-12.8.5 | 4.9 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:34.1883019Z 
2025-05-07T20:26:34.1883023Z 
2025-05-07T20:26:34.1883027Z 
2025-05-07T20:26:34.1883030Z 
2025-05-07T20:26:34.1883044Z 
2025-05-07T20:26:34.1883048Z 
2025-05-07T20:26:34.1883052Z 
2025-05-07T20:26:34.1883056Z 
2025-05-07T20:26:34.1883061Z 
2025-05-07T20:26:34.1883065Z 
2025-05-07T20:26:34.1883069Z 
2025-05-07T20:26:34.1883073Z 
2025-05-07T20:26:34.1883323Z 
2025-05-07T20:26:34.1883329Z 
2025-05-07T20:26:34.1883332Z 
2025-05-07T20:26:34.1883336Z 
2025-05-07T20:26:34.1883346Z 
2025-05-07T20:26:34.4694042Z cuda-sanitizer-api-1 | 8.8 MB    | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:34.4694398Z 
2025-05-07T20:26:34.4694402Z 
2025-05-07T20:26:34.4694412Z 
2025-05-07T20:26:34.4694441Z 
2025-05-07T20:26:34.4694445Z 
2025-05-07T20:26:34.4694449Z 
2025-05-07T20:26:34.4694453Z 
2025-05-07T20:26:34.4694456Z 
2025-05-07T20:26:34.4694460Z 
2025-05-07T20:26:34.4694464Z 
2025-05-07T20:26:34.4694468Z 
2025-05-07T20:26:34.4694472Z 
2025-05-07T20:26:34.4694476Z 
2025-05-07T20:26:34.4694480Z 
2025-05-07T20:26:34.4694484Z 
2025-05-07T20:26:34.6718165Z cuda-nvvm-impl-12.8. | 20.8 MB   | ########## | 100% [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:34.6718509Z 
2025-05-07T20:26:34.6718513Z 
2025-05-07T20:26:34.6718517Z 
2025-05-07T20:26:34.6718521Z 
2025-05-07T20:26:34.6718524Z 
2025-05-07T20:26:34.6718551Z 
2025-05-07T20:26:34.6718556Z 
2025-05-07T20:26:34.6718560Z 
2025-05-07T20:26:34.6718564Z 
2025-05-07T20:26:34.6718567Z 
2025-05-07T20:26:34.6718571Z 
2025-05-07T20:26:34.6718575Z 
2025-05-07T20:26:34.6718578Z 
2025-05-07T20:26:34.6718593Z 
2025-05-07T20:26:34.6718596Z 
2025-05-07T20:26:34.6718600Z 
2025-05-07T20:26:34.6718604Z 
2025-05-07T20:26:34.6718620Z 
2025-05-07T20:26:34.6718623Z 
2025-05-07T20:26:39.9289674Z  ... (more hidden) ...[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:39.9289996Z 
2025-05-07T20:26:41.1132750Z nsight-compute-2025. | 320.6 MB  | ########## | 100% [A
2025-05-07T20:26:41.1141978Z libcublas-12.8.3.14  | 460.2 MB  | ########## | 100% 
2025-05-07T20:26:41.1142344Z 
2025-05-07T20:26:41.1142351Z 
2025-05-07T20:26:41.1142356Z 
2025-05-07T20:26:41.1142361Z 
2025-05-07T20:26:41.1142367Z 
2025-05-07T20:26:41.1142382Z 
2025-05-07T20:26:41.1142388Z 
2025-05-07T20:26:41.1142393Z 
2025-05-07T20:26:41.1142398Z 
2025-05-07T20:26:41.1142403Z 
2025-05-07T20:26:41.1142444Z 
2025-05-07T20:26:41.1142450Z 
2025-05-07T20:26:41.1142455Z 
2025-05-07T20:26:41.1142460Z 
2025-05-07T20:26:41.1142465Z 
2025-05-07T20:26:41.1142470Z 
2025-05-07T20:26:41.1142475Z 
2025-05-07T20:26:41.1142481Z 
2025-05-07T20:26:41.1142486Z 
2025-05-07T20:26:41.1142596Z                       
2025-05-07T20:26:41.1143078Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1143515Z                                                      
2025-05-07T20:26:41.1143784Z 
2025-05-07T20:26:41.1144018Z                                                      [A
2025-05-07T20:26:41.1144286Z 
2025-05-07T20:26:41.1144292Z 
2025-05-07T20:26:41.1144518Z                                                      [A[A
2025-05-07T20:26:41.1144800Z 
2025-05-07T20:26:41.1144806Z 
2025-05-07T20:26:41.1144811Z 
2025-05-07T20:26:41.1145052Z                                                      [A[A[A
2025-05-07T20:26:41.1145335Z 
2025-05-07T20:26:41.1145341Z 
2025-05-07T20:26:41.1145346Z 
2025-05-07T20:26:41.1145359Z 
2025-05-07T20:26:41.1145592Z                                                      [A[A[A[A
2025-05-07T20:26:41.1145878Z 
2025-05-07T20:26:41.1145883Z 
2025-05-07T20:26:41.1145888Z 
2025-05-07T20:26:41.1145893Z 
2025-05-07T20:26:41.1145898Z 
2025-05-07T20:26:41.1146139Z                                                      [A[A[A[A[A
2025-05-07T20:26:41.1146714Z 
2025-05-07T20:26:41.1146720Z 
2025-05-07T20:26:41.1146725Z 
2025-05-07T20:26:41.1146730Z 
2025-05-07T20:26:41.1146735Z 
2025-05-07T20:26:41.1146741Z 
2025-05-07T20:26:41.1146994Z                                                      [A[A[A[A[A[A
2025-05-07T20:26:41.1147284Z 
2025-05-07T20:26:41.1147289Z 
2025-05-07T20:26:41.1147295Z 
2025-05-07T20:26:41.1147300Z 
2025-05-07T20:26:41.1147305Z 
2025-05-07T20:26:41.1147310Z 
2025-05-07T20:26:41.1147315Z 
2025-05-07T20:26:41.1147658Z                                                      [A[A[A[A[A[A[A
2025-05-07T20:26:41.1147952Z 
2025-05-07T20:26:41.1147957Z 
2025-05-07T20:26:41.1148156Z 
2025-05-07T20:26:41.1148163Z 
2025-05-07T20:26:41.1148168Z 
2025-05-07T20:26:41.1148173Z 
2025-05-07T20:26:41.1148179Z 
2025-05-07T20:26:41.1148185Z 
2025-05-07T20:26:41.1148461Z                                                      [A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1148766Z 
2025-05-07T20:26:41.1148772Z 
2025-05-07T20:26:41.1148786Z 
2025-05-07T20:26:41.1148792Z 
2025-05-07T20:26:41.1148797Z 
2025-05-07T20:26:41.1148802Z 
2025-05-07T20:26:41.1148808Z 
2025-05-07T20:26:41.1148813Z 
2025-05-07T20:26:41.1148818Z 
2025-05-07T20:26:41.1149109Z                                                      [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1149400Z 
2025-05-07T20:26:41.1149405Z 
2025-05-07T20:26:41.1149410Z 
2025-05-07T20:26:41.1149420Z 
2025-05-07T20:26:41.1149426Z 
2025-05-07T20:26:41.1149431Z 
2025-05-07T20:26:41.1149436Z 
2025-05-07T20:26:41.1149441Z 
2025-05-07T20:26:41.1149446Z 
2025-05-07T20:26:41.1149452Z 
2025-05-07T20:26:41.1149901Z                                                      [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1150194Z 
2025-05-07T20:26:41.1150205Z 
2025-05-07T20:26:41.1150210Z 
2025-05-07T20:26:41.1150216Z 
2025-05-07T20:26:41.1150221Z 
2025-05-07T20:26:41.1150226Z 
2025-05-07T20:26:41.1150231Z 
2025-05-07T20:26:41.1150245Z 
2025-05-07T20:26:41.1150250Z 
2025-05-07T20:26:41.1150255Z 
2025-05-07T20:26:41.1150268Z 
2025-05-07T20:26:41.1150613Z                                                      [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1150907Z 
2025-05-07T20:26:41.1150921Z 
2025-05-07T20:26:41.1150933Z 
2025-05-07T20:26:41.1150939Z 
2025-05-07T20:26:41.1150944Z 
2025-05-07T20:26:41.1150949Z 
2025-05-07T20:26:41.1150954Z 
2025-05-07T20:26:41.1150960Z 
2025-05-07T20:26:41.1150965Z 
2025-05-07T20:26:41.1150970Z 
2025-05-07T20:26:41.1150975Z 
2025-05-07T20:26:41.1150980Z 
2025-05-07T20:26:41.1151460Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1151706Z 
2025-05-07T20:26:41.1151722Z 
2025-05-07T20:26:41.1151727Z 
2025-05-07T20:26:41.1151738Z 
2025-05-07T20:26:41.1151742Z 
2025-05-07T20:26:41.1151746Z 
2025-05-07T20:26:41.1151749Z 
2025-05-07T20:26:41.1151753Z 
2025-05-07T20:26:41.1151757Z 
2025-05-07T20:26:41.1151760Z 
2025-05-07T20:26:41.1151764Z 
2025-05-07T20:26:41.1151768Z 
2025-05-07T20:26:41.1151771Z 
2025-05-07T20:26:41.1152009Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1152242Z 
2025-05-07T20:26:41.1152252Z 
2025-05-07T20:26:41.1152256Z 
2025-05-07T20:26:41.1152260Z 
2025-05-07T20:26:41.1152264Z 
2025-05-07T20:26:41.1152268Z 
2025-05-07T20:26:41.1152271Z 
2025-05-07T20:26:41.1152275Z 
2025-05-07T20:26:41.1152279Z 
2025-05-07T20:26:41.1152282Z 
2025-05-07T20:26:41.1152286Z 
2025-05-07T20:26:41.1152289Z 
2025-05-07T20:26:41.1152293Z 
2025-05-07T20:26:41.1152297Z 
2025-05-07T20:26:41.1152858Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1153178Z 
2025-05-07T20:26:41.1153184Z 
2025-05-07T20:26:41.1153196Z 
2025-05-07T20:26:41.1153201Z 
2025-05-07T20:26:41.1153207Z 
2025-05-07T20:26:41.1153212Z 
2025-05-07T20:26:41.1153218Z 
2025-05-07T20:26:41.1153223Z 
2025-05-07T20:26:41.1153228Z 
2025-05-07T20:26:41.1153234Z 
2025-05-07T20:26:41.1153239Z 
2025-05-07T20:26:41.1153244Z 
2025-05-07T20:26:41.1153368Z 
2025-05-07T20:26:41.1153373Z 
2025-05-07T20:26:41.1153379Z 
2025-05-07T20:26:41.1153685Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1153993Z 
2025-05-07T20:26:41.1153998Z 
2025-05-07T20:26:41.1154004Z 
2025-05-07T20:26:41.1154009Z 
2025-05-07T20:26:41.1154035Z 
2025-05-07T20:26:41.1154041Z 
2025-05-07T20:26:41.1154046Z 
2025-05-07T20:26:41.1154051Z 
2025-05-07T20:26:41.1154056Z 
2025-05-07T20:26:41.1154061Z 
2025-05-07T20:26:41.1154065Z 
2025-05-07T20:26:41.1154071Z 
2025-05-07T20:26:41.1154076Z 
2025-05-07T20:26:41.1154081Z 
2025-05-07T20:26:41.1154087Z 
2025-05-07T20:26:41.1154182Z 
2025-05-07T20:26:41.1154481Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1154803Z 
2025-05-07T20:26:41.1154809Z 
2025-05-07T20:26:41.1154814Z 
2025-05-07T20:26:41.1154828Z 
2025-05-07T20:26:41.1154833Z 
2025-05-07T20:26:41.1154838Z 
2025-05-07T20:26:41.1154843Z 
2025-05-07T20:26:41.1154856Z 
2025-05-07T20:26:41.1154862Z 
2025-05-07T20:26:41.1154867Z 
2025-05-07T20:26:41.1154872Z 
2025-05-07T20:26:41.1154877Z 
2025-05-07T20:26:41.1154883Z 
2025-05-07T20:26:41.1154888Z 
2025-05-07T20:26:41.1154893Z 
2025-05-07T20:26:41.1154898Z 
2025-05-07T20:26:41.1154911Z 
2025-05-07T20:26:41.1155202Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1155512Z 
2025-05-07T20:26:41.1155517Z 
2025-05-07T20:26:41.1155522Z 
2025-05-07T20:26:41.1155528Z 
2025-05-07T20:26:41.1155548Z 
2025-05-07T20:26:41.1155554Z 
2025-05-07T20:26:41.1155559Z 
2025-05-07T20:26:41.1155571Z 
2025-05-07T20:26:41.1155576Z 
2025-05-07T20:26:41.1155582Z 
2025-05-07T20:26:41.1155587Z 
2025-05-07T20:26:41.1155592Z 
2025-05-07T20:26:41.1155597Z 
2025-05-07T20:26:41.1155603Z 
2025-05-07T20:26:41.1155608Z 
2025-05-07T20:26:41.1155613Z 
2025-05-07T20:26:41.1155618Z 
2025-05-07T20:26:41.1155624Z 
2025-05-07T20:26:41.1156530Z                                                      [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1156849Z 
2025-05-07T20:26:41.1156858Z 
2025-05-07T20:26:41.1157286Z [A
2025-05-07T20:26:41.1157424Z 
2025-05-07T20:26:41.1157428Z 
2025-05-07T20:26:41.1157820Z [A[A
2025-05-07T20:26:41.1157937Z 
2025-05-07T20:26:41.1157943Z 
2025-05-07T20:26:41.1157947Z 
2025-05-07T20:26:41.1158527Z [A[A[A
2025-05-07T20:26:41.1158649Z 
2025-05-07T20:26:41.1158654Z 
2025-05-07T20:26:41.1158660Z 
2025-05-07T20:26:41.1158665Z 
2025-05-07T20:26:41.1159369Z [A[A[A[A
2025-05-07T20:26:41.1159551Z 
2025-05-07T20:26:41.1159557Z 
2025-05-07T20:26:41.1159562Z 
2025-05-07T20:26:41.1159580Z 
2025-05-07T20:26:41.1159586Z 
2025-05-07T20:26:41.1159950Z [A[A[A[A[A
2025-05-07T20:26:41.1160129Z 
2025-05-07T20:26:41.1160135Z 
2025-05-07T20:26:41.1160140Z 
2025-05-07T20:26:41.1160145Z 
2025-05-07T20:26:41.1160151Z 
2025-05-07T20:26:41.1160159Z 
2025-05-07T20:26:41.1160549Z [A[A[A[A[A[A
2025-05-07T20:26:41.1160749Z 
2025-05-07T20:26:41.1160764Z 
2025-05-07T20:26:41.1160770Z 
2025-05-07T20:26:41.1160779Z 
2025-05-07T20:26:41.1160784Z 
2025-05-07T20:26:41.1160790Z 
2025-05-07T20:26:41.1160795Z 
2025-05-07T20:26:41.1161164Z [A[A[A[A[A[A[A
2025-05-07T20:26:41.1161366Z 
2025-05-07T20:26:41.1161379Z 
2025-05-07T20:26:41.1161385Z 
2025-05-07T20:26:41.1161389Z 
2025-05-07T20:26:41.1161393Z 
2025-05-07T20:26:41.1161396Z 
2025-05-07T20:26:41.1161400Z 
2025-05-07T20:26:41.1161404Z 
2025-05-07T20:26:41.1161872Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1162088Z 
2025-05-07T20:26:41.1162094Z 
2025-05-07T20:26:41.1162107Z 
2025-05-07T20:26:41.1162113Z 
2025-05-07T20:26:41.1162129Z 
2025-05-07T20:26:41.1162135Z 
2025-05-07T20:26:41.1162140Z 
2025-05-07T20:26:41.1162146Z 
2025-05-07T20:26:41.1162160Z 
2025-05-07T20:26:41.1162537Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1162756Z 
2025-05-07T20:26:41.1162761Z 
2025-05-07T20:26:41.1162764Z 
2025-05-07T20:26:41.1162774Z 
2025-05-07T20:26:41.1162778Z 
2025-05-07T20:26:41.1162782Z 
2025-05-07T20:26:41.1162902Z 
2025-05-07T20:26:41.1162906Z 
2025-05-07T20:26:41.1162909Z 
2025-05-07T20:26:41.1162913Z 
2025-05-07T20:26:41.1163172Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1163397Z 
2025-05-07T20:26:41.1163407Z 
2025-05-07T20:26:41.1163410Z 
2025-05-07T20:26:41.1163414Z 
2025-05-07T20:26:41.1163418Z 
2025-05-07T20:26:41.1163421Z 
2025-05-07T20:26:41.1163425Z 
2025-05-07T20:26:41.1163429Z 
2025-05-07T20:26:41.1163432Z 
2025-05-07T20:26:41.1163436Z 
2025-05-07T20:26:41.1163440Z 
2025-05-07T20:26:41.1163846Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1164088Z 
2025-05-07T20:26:41.1164094Z 
2025-05-07T20:26:41.1164237Z 
2025-05-07T20:26:41.1164244Z 
2025-05-07T20:26:41.1164266Z 
2025-05-07T20:26:41.1164272Z 
2025-05-07T20:26:41.1164277Z 
2025-05-07T20:26:41.1164282Z 
2025-05-07T20:26:41.1164288Z 
2025-05-07T20:26:41.1164293Z 
2025-05-07T20:26:41.1164298Z 
2025-05-07T20:26:41.1164303Z 
2025-05-07T20:26:41.1164521Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1164774Z 
2025-05-07T20:26:41.1164785Z 
2025-05-07T20:26:41.1164791Z 
2025-05-07T20:26:41.1164796Z 
2025-05-07T20:26:41.1164801Z 
2025-05-07T20:26:41.1164806Z 
2025-05-07T20:26:41.1164811Z 
2025-05-07T20:26:41.1164817Z 
2025-05-07T20:26:41.1164822Z 
2025-05-07T20:26:41.1164827Z 
2025-05-07T20:26:41.1164832Z 
2025-05-07T20:26:41.1164837Z 
2025-05-07T20:26:41.1164842Z 
2025-05-07T20:26:41.1165217Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1165471Z 
2025-05-07T20:26:41.1165484Z 
2025-05-07T20:26:41.1165489Z 
2025-05-07T20:26:41.1165494Z 
2025-05-07T20:26:41.1165499Z 
2025-05-07T20:26:41.1165504Z 
2025-05-07T20:26:41.1165510Z 
2025-05-07T20:26:41.1165523Z 
2025-05-07T20:26:41.1165528Z 
2025-05-07T20:26:41.1165533Z 
2025-05-07T20:26:41.1165539Z 
2025-05-07T20:26:41.1165544Z 
2025-05-07T20:26:41.1165558Z 
2025-05-07T20:26:41.1165563Z 
2025-05-07T20:26:41.1165833Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1166028Z 
2025-05-07T20:26:41.1166032Z 
2025-05-07T20:26:41.1166036Z 
2025-05-07T20:26:41.1166057Z 
2025-05-07T20:26:41.1166061Z 
2025-05-07T20:26:41.1166064Z 
2025-05-07T20:26:41.1166068Z 
2025-05-07T20:26:41.1166072Z 
2025-05-07T20:26:41.1166086Z 
2025-05-07T20:26:41.1166090Z 
2025-05-07T20:26:41.1166094Z 
2025-05-07T20:26:41.1166097Z 
2025-05-07T20:26:41.1166101Z 
2025-05-07T20:26:41.1166105Z 
2025-05-07T20:26:41.1166108Z 
2025-05-07T20:26:41.1166480Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1166747Z 
2025-05-07T20:26:41.1166759Z 
2025-05-07T20:26:41.1166764Z 
2025-05-07T20:26:41.1166776Z 
2025-05-07T20:26:41.1166782Z 
2025-05-07T20:26:41.1166787Z 
2025-05-07T20:26:41.1166793Z 
2025-05-07T20:26:41.1166804Z 
2025-05-07T20:26:41.1166810Z 
2025-05-07T20:26:41.1166815Z 
2025-05-07T20:26:41.1166821Z 
2025-05-07T20:26:41.1166826Z 
2025-05-07T20:26:41.1166831Z 
2025-05-07T20:26:41.1166837Z 
2025-05-07T20:26:41.1166842Z 
2025-05-07T20:26:41.1166847Z 
2025-05-07T20:26:41.1167095Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1167338Z 
2025-05-07T20:26:41.1167354Z 
2025-05-07T20:26:41.1167358Z 
2025-05-07T20:26:41.1167362Z 
2025-05-07T20:26:41.1167365Z 
2025-05-07T20:26:41.1167369Z 
2025-05-07T20:26:41.1167373Z 
2025-05-07T20:26:41.1167376Z 
2025-05-07T20:26:41.1167380Z 
2025-05-07T20:26:41.1167384Z 
2025-05-07T20:26:41.1167387Z 
2025-05-07T20:26:41.1167391Z 
2025-05-07T20:26:41.1167394Z 
2025-05-07T20:26:41.1167398Z 
2025-05-07T20:26:41.1167402Z 
2025-05-07T20:26:41.1167405Z 
2025-05-07T20:26:41.1167409Z 
2025-05-07T20:26:41.1167829Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1168124Z 
2025-05-07T20:26:41.1168129Z 
2025-05-07T20:26:41.1168135Z 
2025-05-07T20:26:41.1168157Z 
2025-05-07T20:26:41.1168163Z 
2025-05-07T20:26:41.1168169Z 
2025-05-07T20:26:41.1168174Z 
2025-05-07T20:26:41.1168180Z 
2025-05-07T20:26:41.1168185Z 
2025-05-07T20:26:41.1168191Z 
2025-05-07T20:26:41.1168196Z 
2025-05-07T20:26:41.1168202Z 
2025-05-07T20:26:41.1168207Z 
2025-05-07T20:26:41.1168212Z 
2025-05-07T20:26:41.1168217Z 
2025-05-07T20:26:41.1168756Z 
2025-05-07T20:26:41.1168760Z 
2025-05-07T20:26:41.1168764Z 
2025-05-07T20:26:41.1169050Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1169348Z 
2025-05-07T20:26:41.1169354Z 
2025-05-07T20:26:41.1169630Z [A
2025-05-07T20:26:41.1169781Z 
2025-05-07T20:26:41.1169790Z 
2025-05-07T20:26:41.1170245Z [A[A
2025-05-07T20:26:41.1170412Z 
2025-05-07T20:26:41.1170420Z 
2025-05-07T20:26:41.1170431Z 
2025-05-07T20:26:41.1170946Z [A[A[A
2025-05-07T20:26:41.1171102Z 
2025-05-07T20:26:41.1171107Z 
2025-05-07T20:26:41.1171116Z 
2025-05-07T20:26:41.1171122Z 
2025-05-07T20:26:41.1171590Z [A[A[A[A
2025-05-07T20:26:41.1171887Z 
2025-05-07T20:26:41.1171902Z 
2025-05-07T20:26:41.1171906Z 
2025-05-07T20:26:41.1171910Z 
2025-05-07T20:26:41.1171913Z 
2025-05-07T20:26:41.1172291Z [A[A[A[A[A
2025-05-07T20:26:41.1172465Z 
2025-05-07T20:26:41.1172471Z 
2025-05-07T20:26:41.1172477Z 
2025-05-07T20:26:41.1172486Z 
2025-05-07T20:26:41.1172491Z 
2025-05-07T20:26:41.1172496Z 
2025-05-07T20:26:41.1172942Z [A[A[A[A[A[A
2025-05-07T20:26:41.1173126Z 
2025-05-07T20:26:41.1173132Z 
2025-05-07T20:26:41.1173144Z 
2025-05-07T20:26:41.1173149Z 
2025-05-07T20:26:41.1173155Z 
2025-05-07T20:26:41.1173160Z 
2025-05-07T20:26:41.1173166Z 
2025-05-07T20:26:41.1173597Z [A[A[A[A[A[A[A
2025-05-07T20:26:41.1173790Z 
2025-05-07T20:26:41.1173800Z 
2025-05-07T20:26:41.1173806Z 
2025-05-07T20:26:41.1173811Z 
2025-05-07T20:26:41.1173816Z 
2025-05-07T20:26:41.1173821Z 
2025-05-07T20:26:41.1173826Z 
2025-05-07T20:26:41.1173832Z 
2025-05-07T20:26:41.1174249Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1174456Z 
2025-05-07T20:26:41.1174466Z 
2025-05-07T20:26:41.1174481Z 
2025-05-07T20:26:41.1174487Z 
2025-05-07T20:26:41.1174492Z 
2025-05-07T20:26:41.1174497Z 
2025-05-07T20:26:41.1174502Z 
2025-05-07T20:26:41.1174508Z 
2025-05-07T20:26:41.1174513Z 
2025-05-07T20:26:41.1174952Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1175170Z 
2025-05-07T20:26:41.1175174Z 
2025-05-07T20:26:41.1175178Z 
2025-05-07T20:26:41.1175194Z 
2025-05-07T20:26:41.1175198Z 
2025-05-07T20:26:41.1175201Z 
2025-05-07T20:26:41.1175205Z 
2025-05-07T20:26:41.1175209Z 
2025-05-07T20:26:41.1175212Z 
2025-05-07T20:26:41.1175216Z 
2025-05-07T20:26:41.1175520Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1175725Z 
2025-05-07T20:26:41.1175734Z 
2025-05-07T20:26:41.1175738Z 
2025-05-07T20:26:41.1175742Z 
2025-05-07T20:26:41.1175746Z 
2025-05-07T20:26:41.1175749Z 
2025-05-07T20:26:41.1175753Z 
2025-05-07T20:26:41.1175757Z 
2025-05-07T20:26:41.1175760Z 
2025-05-07T20:26:41.1175771Z 
2025-05-07T20:26:41.1175774Z 
2025-05-07T20:26:41.1176247Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1176499Z 
2025-05-07T20:26:41.1176505Z 
2025-05-07T20:26:41.1176510Z 
2025-05-07T20:26:41.1176516Z 
2025-05-07T20:26:41.1176521Z 
2025-05-07T20:26:41.1176532Z 
2025-05-07T20:26:41.1176537Z 
2025-05-07T20:26:41.1176542Z 
2025-05-07T20:26:41.1176547Z 
2025-05-07T20:26:41.1176561Z 
2025-05-07T20:26:41.1176567Z 
2025-05-07T20:26:41.1176572Z 
2025-05-07T20:26:41.1176792Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1177013Z 
2025-05-07T20:26:41.1177023Z 
2025-05-07T20:26:41.1177027Z 
2025-05-07T20:26:41.1177037Z 
2025-05-07T20:26:41.1177041Z 
2025-05-07T20:26:41.1177045Z 
2025-05-07T20:26:41.1177049Z 
2025-05-07T20:26:41.1177052Z 
2025-05-07T20:26:41.1177056Z 
2025-05-07T20:26:41.1177060Z 
2025-05-07T20:26:41.1177063Z 
2025-05-07T20:26:41.1177067Z 
2025-05-07T20:26:41.1177071Z 
2025-05-07T20:26:41.1177492Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1177760Z 
2025-05-07T20:26:41.1177766Z 
2025-05-07T20:26:41.1177772Z 
2025-05-07T20:26:41.1177785Z 
2025-05-07T20:26:41.1177791Z 
2025-05-07T20:26:41.1177803Z 
2025-05-07T20:26:41.1177809Z 
2025-05-07T20:26:41.1177815Z 
2025-05-07T20:26:41.1177820Z 
2025-05-07T20:26:41.1177826Z 
2025-05-07T20:26:41.1177831Z 
2025-05-07T20:26:41.1177837Z 
2025-05-07T20:26:41.1177842Z 
2025-05-07T20:26:41.1177847Z 
2025-05-07T20:26:41.1178138Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1178402Z 
2025-05-07T20:26:41.1178533Z 
2025-05-07T20:26:41.1178538Z 
2025-05-07T20:26:41.1178561Z 
2025-05-07T20:26:41.1178567Z 
2025-05-07T20:26:41.1178572Z 
2025-05-07T20:26:41.1178577Z 
2025-05-07T20:26:41.1178583Z 
2025-05-07T20:26:41.1178588Z 
2025-05-07T20:26:41.1178594Z 
2025-05-07T20:26:41.1178599Z 
2025-05-07T20:26:41.1178604Z 
2025-05-07T20:26:41.1178609Z 
2025-05-07T20:26:41.1178614Z 
2025-05-07T20:26:41.1178619Z 
2025-05-07T20:26:41.1178824Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1179106Z 
2025-05-07T20:26:41.1179111Z 
2025-05-07T20:26:41.1179116Z 
2025-05-07T20:26:41.1179121Z 
2025-05-07T20:26:41.1179126Z 
2025-05-07T20:26:41.1179218Z 
2025-05-07T20:26:41.1179224Z 
2025-05-07T20:26:41.1179230Z 
2025-05-07T20:26:41.1179234Z 
2025-05-07T20:26:41.1179239Z 
2025-05-07T20:26:41.1179244Z 
2025-05-07T20:26:41.1179250Z 
2025-05-07T20:26:41.1179255Z 
2025-05-07T20:26:41.1179260Z 
2025-05-07T20:26:41.1179266Z 
2025-05-07T20:26:41.1179271Z 
2025-05-07T20:26:41.1179505Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1179785Z 
2025-05-07T20:26:41.1179790Z 
2025-05-07T20:26:41.1179795Z 
2025-05-07T20:26:41.1179801Z 
2025-05-07T20:26:41.1179806Z 
2025-05-07T20:26:41.1179811Z 
2025-05-07T20:26:41.1179826Z 
2025-05-07T20:26:41.1179832Z 
2025-05-07T20:26:41.1179837Z 
2025-05-07T20:26:41.1179842Z 
2025-05-07T20:26:41.1179847Z 
2025-05-07T20:26:41.1179853Z 
2025-05-07T20:26:41.1179858Z 
2025-05-07T20:26:41.1179864Z 
2025-05-07T20:26:41.1179870Z 
2025-05-07T20:26:41.1179875Z 
2025-05-07T20:26:41.1179890Z 
2025-05-07T20:26:41.1180102Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1180393Z 
2025-05-07T20:26:41.1180405Z 
2025-05-07T20:26:41.1180411Z 
2025-05-07T20:26:41.1180416Z 
2025-05-07T20:26:41.1180421Z 
2025-05-07T20:26:41.1180426Z 
2025-05-07T20:26:41.1180431Z 
2025-05-07T20:26:41.1180436Z 
2025-05-07T20:26:41.1180442Z 
2025-05-07T20:26:41.1180447Z 
2025-05-07T20:26:41.1180452Z 
2025-05-07T20:26:41.1180457Z 
2025-05-07T20:26:41.1180462Z 
2025-05-07T20:26:41.1180473Z 
2025-05-07T20:26:41.1180478Z 
2025-05-07T20:26:41.1180483Z 
2025-05-07T20:26:41.1180489Z 
2025-05-07T20:26:41.1180506Z 
2025-05-07T20:26:41.1181436Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1181738Z 
2025-05-07T20:26:41.1181744Z 
2025-05-07T20:26:41.1181889Z [A
2025-05-07T20:26:41.1182022Z 
2025-05-07T20:26:41.1182027Z 
2025-05-07T20:26:41.1182405Z [A[A
2025-05-07T20:26:41.1182545Z 
2025-05-07T20:26:41.1182554Z 
2025-05-07T20:26:41.1182559Z 
2025-05-07T20:26:41.1183079Z [A[A[A
2025-05-07T20:26:41.1183233Z 
2025-05-07T20:26:41.1183238Z 
2025-05-07T20:26:41.1183244Z 
2025-05-07T20:26:41.1183249Z 
2025-05-07T20:26:41.1183598Z [A[A[A[A
2025-05-07T20:26:41.1183775Z 
2025-05-07T20:26:41.1183780Z 
2025-05-07T20:26:41.1183785Z 
2025-05-07T20:26:41.1183794Z 
2025-05-07T20:26:41.1183799Z 
2025-05-07T20:26:41.1184151Z [A[A[A[A[A
2025-05-07T20:26:41.1184329Z 
2025-05-07T20:26:41.1184342Z 
2025-05-07T20:26:41.1184347Z 
2025-05-07T20:26:41.1184352Z 
2025-05-07T20:26:41.1184364Z 
2025-05-07T20:26:41.1184369Z 
2025-05-07T20:26:41.1184881Z [A[A[A[A[A[A
2025-05-07T20:26:41.1185029Z 
2025-05-07T20:26:41.1185034Z 
2025-05-07T20:26:41.1185038Z 
2025-05-07T20:26:41.1185041Z 
2025-05-07T20:26:41.1185045Z 
2025-05-07T20:26:41.1185049Z 
2025-05-07T20:26:41.1185053Z 
2025-05-07T20:26:41.1185336Z [A[A[A[A[A[A[A
2025-05-07T20:26:41.1185530Z 
2025-05-07T20:26:41.1185536Z 
2025-05-07T20:26:41.1185542Z 
2025-05-07T20:26:41.1185547Z 
2025-05-07T20:26:41.1185556Z 
2025-05-07T20:26:41.1185561Z 
2025-05-07T20:26:41.1185566Z 
2025-05-07T20:26:41.1185572Z 
2025-05-07T20:26:41.1185929Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1186147Z 
2025-05-07T20:26:41.1186153Z 
2025-05-07T20:26:41.1186158Z 
2025-05-07T20:26:41.1186164Z 
2025-05-07T20:26:41.1186169Z 
2025-05-07T20:26:41.1186178Z 
2025-05-07T20:26:41.1186183Z 
2025-05-07T20:26:41.1186188Z 
2025-05-07T20:26:41.1186193Z 
2025-05-07T20:26:41.1186498Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1186711Z 
2025-05-07T20:26:41.1186846Z 
2025-05-07T20:26:41.1186852Z 
2025-05-07T20:26:41.1186857Z 
2025-05-07T20:26:41.1186862Z 
2025-05-07T20:26:41.1186868Z 
2025-05-07T20:26:41.1186873Z 
2025-05-07T20:26:41.1186878Z 
2025-05-07T20:26:41.1186884Z 
2025-05-07T20:26:41.1186889Z 
2025-05-07T20:26:41.1187083Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1187298Z 
2025-05-07T20:26:41.1187304Z 
2025-05-07T20:26:41.1187309Z 
2025-05-07T20:26:41.1187314Z 
2025-05-07T20:26:41.1187319Z 
2025-05-07T20:26:41.1187324Z 
2025-05-07T20:26:41.1187330Z 
2025-05-07T20:26:41.1187338Z 
2025-05-07T20:26:41.1187343Z 
2025-05-07T20:26:41.1187361Z 
2025-05-07T20:26:41.1187366Z 
2025-05-07T20:26:41.1187868Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1188087Z 
2025-05-07T20:26:41.1188092Z 
2025-05-07T20:26:41.1188095Z 
2025-05-07T20:26:41.1188103Z 
2025-05-07T20:26:41.1188203Z 
2025-05-07T20:26:41.1188209Z 
2025-05-07T20:26:41.1188212Z 
2025-05-07T20:26:41.1188216Z 
2025-05-07T20:26:41.1188220Z 
2025-05-07T20:26:41.1188224Z 
2025-05-07T20:26:41.1188239Z 
2025-05-07T20:26:41.1188384Z 
2025-05-07T20:26:41.1188645Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1188900Z 
2025-05-07T20:26:41.1188906Z 
2025-05-07T20:26:41.1188911Z 
2025-05-07T20:26:41.1188917Z 
2025-05-07T20:26:41.1188931Z 
2025-05-07T20:26:41.1188937Z 
2025-05-07T20:26:41.1188950Z 
2025-05-07T20:26:41.1188955Z 
2025-05-07T20:26:41.1188960Z 
2025-05-07T20:26:41.1188966Z 
2025-05-07T20:26:41.1188971Z 
2025-05-07T20:26:41.1188976Z 
2025-05-07T20:26:41.1188982Z 
2025-05-07T20:26:41.1189176Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1189431Z 
2025-05-07T20:26:41.1189448Z 
2025-05-07T20:26:41.1189464Z 
2025-05-07T20:26:41.1189469Z 
2025-05-07T20:26:41.1189475Z 
2025-05-07T20:26:41.1189480Z 
2025-05-07T20:26:41.1189485Z 
2025-05-07T20:26:41.1189490Z 
2025-05-07T20:26:41.1189495Z 
2025-05-07T20:26:41.1189501Z 
2025-05-07T20:26:41.1189506Z 
2025-05-07T20:26:41.1189511Z 
2025-05-07T20:26:41.1189516Z 
2025-05-07T20:26:41.1189522Z 
2025-05-07T20:26:41.1189719Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1189997Z 
2025-05-07T20:26:41.1190003Z 
2025-05-07T20:26:41.1190008Z 
2025-05-07T20:26:41.1190013Z 
2025-05-07T20:26:41.1190019Z 
2025-05-07T20:26:41.1190024Z 
2025-05-07T20:26:41.1190029Z 
2025-05-07T20:26:41.1190035Z 
2025-05-07T20:26:41.1190040Z 
2025-05-07T20:26:41.1190045Z 
2025-05-07T20:26:41.1190050Z 
2025-05-07T20:26:41.1190054Z 
2025-05-07T20:26:41.1190059Z 
2025-05-07T20:26:41.1190064Z 
2025-05-07T20:26:41.1190070Z 
2025-05-07T20:26:41.1190293Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1190556Z 
2025-05-07T20:26:41.1190561Z 
2025-05-07T20:26:41.1190565Z 
2025-05-07T20:26:41.1190577Z 
2025-05-07T20:26:41.1190582Z 
2025-05-07T20:26:41.1190586Z 
2025-05-07T20:26:41.1190598Z 
2025-05-07T20:26:41.1190603Z 
2025-05-07T20:26:41.1190607Z 
2025-05-07T20:26:41.1190612Z 
2025-05-07T20:26:41.1190617Z 
2025-05-07T20:26:41.1190622Z 
2025-05-07T20:26:41.1190626Z 
2025-05-07T20:26:41.1190631Z 
2025-05-07T20:26:41.1190636Z 
2025-05-07T20:26:41.1190648Z 
2025-05-07T20:26:41.1190880Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1191157Z 
2025-05-07T20:26:41.1191163Z 
2025-05-07T20:26:41.1191168Z 
2025-05-07T20:26:41.1191173Z 
2025-05-07T20:26:41.1191179Z 
2025-05-07T20:26:41.1191193Z 
2025-05-07T20:26:41.1191199Z 
2025-05-07T20:26:41.1191204Z 
2025-05-07T20:26:41.1191209Z 
2025-05-07T20:26:41.1191214Z 
2025-05-07T20:26:41.1191220Z 
2025-05-07T20:26:41.1191225Z 
2025-05-07T20:26:41.1191230Z 
2025-05-07T20:26:41.1191235Z 
2025-05-07T20:26:41.1191240Z 
2025-05-07T20:26:41.1191246Z 
2025-05-07T20:26:41.1191251Z 
2025-05-07T20:26:41.1191475Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1191759Z 
2025-05-07T20:26:41.1191765Z 
2025-05-07T20:26:41.1191770Z 
2025-05-07T20:26:41.1191775Z 
2025-05-07T20:26:41.1191781Z 
2025-05-07T20:26:41.1191786Z 
2025-05-07T20:26:41.1191792Z 
2025-05-07T20:26:41.1191797Z 
2025-05-07T20:26:41.1191813Z 
2025-05-07T20:26:41.1191818Z 
2025-05-07T20:26:41.1191823Z 
2025-05-07T20:26:41.1191992Z 
2025-05-07T20:26:41.1191997Z 
2025-05-07T20:26:41.1192002Z 
2025-05-07T20:26:41.1192008Z 
2025-05-07T20:26:41.1192013Z 
2025-05-07T20:26:41.1192018Z 
2025-05-07T20:26:41.1192023Z 
2025-05-07T20:26:41.1192265Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1192558Z 
2025-05-07T20:26:41.1192563Z 
2025-05-07T20:26:41.1192699Z [A
2025-05-07T20:26:41.1192832Z 
2025-05-07T20:26:41.1192838Z 
2025-05-07T20:26:41.1192986Z [A[A
2025-05-07T20:26:41.1193123Z 
2025-05-07T20:26:41.1193129Z 
2025-05-07T20:26:41.1193134Z 
2025-05-07T20:26:41.1193283Z [A[A[A
2025-05-07T20:26:41.1193442Z 
2025-05-07T20:26:41.1193542Z 
2025-05-07T20:26:41.1193548Z 
2025-05-07T20:26:41.1193554Z 
2025-05-07T20:26:41.1193706Z [A[A[A[A
2025-05-07T20:26:41.1193870Z 
2025-05-07T20:26:41.1193875Z 
2025-05-07T20:26:41.1193881Z 
2025-05-07T20:26:41.1193886Z 
2025-05-07T20:26:41.1193891Z 
2025-05-07T20:26:41.1194041Z [A[A[A[A[A
2025-05-07T20:26:41.1194216Z 
2025-05-07T20:26:41.1194222Z 
2025-05-07T20:26:41.1194235Z 
2025-05-07T20:26:41.1194240Z 
2025-05-07T20:26:41.1194246Z 
2025-05-07T20:26:41.1194251Z 
2025-05-07T20:26:41.1194412Z [A[A[A[A[A[A
2025-05-07T20:26:41.1194595Z 
2025-05-07T20:26:41.1194600Z 
2025-05-07T20:26:41.1194606Z 
2025-05-07T20:26:41.1194611Z 
2025-05-07T20:26:41.1194616Z 
2025-05-07T20:26:41.1194621Z 
2025-05-07T20:26:41.1194627Z 
2025-05-07T20:26:41.1194796Z [A[A[A[A[A[A[A
2025-05-07T20:26:41.1194991Z 
2025-05-07T20:26:41.1194996Z 
2025-05-07T20:26:41.1195001Z 
2025-05-07T20:26:41.1195007Z 
2025-05-07T20:26:41.1195012Z 
2025-05-07T20:26:41.1195017Z 
2025-05-07T20:26:41.1195022Z 
2025-05-07T20:26:41.1195036Z 
2025-05-07T20:26:41.1195207Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1195414Z 
2025-05-07T20:26:41.1195419Z 
2025-05-07T20:26:41.1195425Z 
2025-05-07T20:26:41.1195430Z 
2025-05-07T20:26:41.1195435Z 
2025-05-07T20:26:41.1195440Z 
2025-05-07T20:26:41.1195445Z 
2025-05-07T20:26:41.1195450Z 
2025-05-07T20:26:41.1195456Z 
2025-05-07T20:26:41.1195628Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1195850Z 
2025-05-07T20:26:41.1195856Z 
2025-05-07T20:26:41.1195861Z 
2025-05-07T20:26:41.1195866Z 
2025-05-07T20:26:41.1195871Z 
2025-05-07T20:26:41.1195876Z 
2025-05-07T20:26:41.1195881Z 
2025-05-07T20:26:41.1195887Z 
2025-05-07T20:26:41.1195892Z 
2025-05-07T20:26:41.1195897Z 
2025-05-07T20:26:41.1196088Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1196305Z 
2025-05-07T20:26:41.1196311Z 
2025-05-07T20:26:41.1196316Z 
2025-05-07T20:26:41.1196321Z 
2025-05-07T20:26:41.1196326Z 
2025-05-07T20:26:41.1196331Z 
2025-05-07T20:26:41.1196337Z 
2025-05-07T20:26:41.1196342Z 
2025-05-07T20:26:41.1196347Z 
2025-05-07T20:26:41.1196359Z 
2025-05-07T20:26:41.1196364Z 
2025-05-07T20:26:41.1196551Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1196783Z 
2025-05-07T20:26:41.1196788Z 
2025-05-07T20:26:41.1196793Z 
2025-05-07T20:26:41.1196798Z 
2025-05-07T20:26:41.1196803Z 
2025-05-07T20:26:41.1196808Z 
2025-05-07T20:26:41.1196814Z 
2025-05-07T20:26:41.1196819Z 
2025-05-07T20:26:41.1196837Z 
2025-05-07T20:26:41.1196842Z 
2025-05-07T20:26:41.1196847Z 
2025-05-07T20:26:41.1196852Z 
2025-05-07T20:26:41.1197034Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1197277Z 
2025-05-07T20:26:41.1197282Z 
2025-05-07T20:26:41.1197295Z 
2025-05-07T20:26:41.1197300Z 
2025-05-07T20:26:41.1197305Z 
2025-05-07T20:26:41.1197310Z 
2025-05-07T20:26:41.1197315Z 
2025-05-07T20:26:41.1197319Z 
2025-05-07T20:26:41.1197324Z 
2025-05-07T20:26:41.1197329Z 
2025-05-07T20:26:41.1197333Z 
2025-05-07T20:26:41.1197338Z 
2025-05-07T20:26:41.1197343Z 
2025-05-07T20:26:41.1197525Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1197799Z 
2025-05-07T20:26:41.1197806Z 
2025-05-07T20:26:41.1197811Z 
2025-05-07T20:26:41.1197816Z 
2025-05-07T20:26:41.1197821Z 
2025-05-07T20:26:41.1197826Z 
2025-05-07T20:26:41.1197831Z 
2025-05-07T20:26:41.1197836Z 
2025-05-07T20:26:41.1197841Z 
2025-05-07T20:26:41.1197846Z 
2025-05-07T20:26:41.1197851Z 
2025-05-07T20:26:41.1197856Z 
2025-05-07T20:26:41.1197987Z 
2025-05-07T20:26:41.1197992Z 
2025-05-07T20:26:41.1198217Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1198411Z 
2025-05-07T20:26:41.1198415Z 
2025-05-07T20:26:41.1198418Z 
2025-05-07T20:26:41.1198422Z 
2025-05-07T20:26:41.1198426Z 
2025-05-07T20:26:41.1198429Z 
2025-05-07T20:26:41.1198433Z 
2025-05-07T20:26:41.1198437Z 
2025-05-07T20:26:41.1198440Z 
2025-05-07T20:26:41.1198444Z 
2025-05-07T20:26:41.1198456Z 
2025-05-07T20:26:41.1198460Z 
2025-05-07T20:26:41.1198464Z 
2025-05-07T20:26:41.1198467Z 
2025-05-07T20:26:41.1198471Z 
2025-05-07T20:26:41.1198623Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1198907Z 
2025-05-07T20:26:41.1198912Z 
2025-05-07T20:26:41.1198915Z 
2025-05-07T20:26:41.1198919Z 
2025-05-07T20:26:41.1198923Z 
2025-05-07T20:26:41.1198926Z 
2025-05-07T20:26:41.1198930Z 
2025-05-07T20:26:41.1198934Z 
2025-05-07T20:26:41.1198937Z 
2025-05-07T20:26:41.1198941Z 
2025-05-07T20:26:41.1198944Z 
2025-05-07T20:26:41.1198948Z 
2025-05-07T20:26:41.1198952Z 
2025-05-07T20:26:41.1198963Z 
2025-05-07T20:26:41.1198967Z 
2025-05-07T20:26:41.1198970Z 
2025-05-07T20:26:41.1199157Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1199432Z 
2025-05-07T20:26:41.1199455Z 
2025-05-07T20:26:41.1199461Z 
2025-05-07T20:26:41.1199467Z 
2025-05-07T20:26:41.1199472Z 
2025-05-07T20:26:41.1199477Z 
2025-05-07T20:26:41.1199483Z 
2025-05-07T20:26:41.1199488Z 
2025-05-07T20:26:41.1199493Z 
2025-05-07T20:26:41.1199498Z 
2025-05-07T20:26:41.1199511Z 
2025-05-07T20:26:41.1199517Z 
2025-05-07T20:26:41.1199522Z 
2025-05-07T20:26:41.1199527Z 
2025-05-07T20:26:41.1199532Z 
2025-05-07T20:26:41.1199537Z 
2025-05-07T20:26:41.1199550Z 
2025-05-07T20:26:41.1199802Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1200090Z 
2025-05-07T20:26:41.1200100Z 
2025-05-07T20:26:41.1200104Z 
2025-05-07T20:26:41.1200108Z 
2025-05-07T20:26:41.1200112Z 
2025-05-07T20:26:41.1200115Z 
2025-05-07T20:26:41.1200119Z 
2025-05-07T20:26:41.1200123Z 
2025-05-07T20:26:41.1200135Z 
2025-05-07T20:26:41.1200138Z 
2025-05-07T20:26:41.1200142Z 
2025-05-07T20:26:41.1200145Z 
2025-05-07T20:26:41.1200149Z 
2025-05-07T20:26:41.1200153Z 
2025-05-07T20:26:41.1200156Z 
2025-05-07T20:26:41.1200160Z 
2025-05-07T20:26:41.1200164Z 
2025-05-07T20:26:41.1200170Z 
2025-05-07T20:26:41.1201480Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1201804Z 
2025-05-07T20:26:41.1202429Z 
2025-05-07T20:26:41.1202637Z [A
2025-05-07T20:26:41.1202799Z 
2025-05-07T20:26:41.1202819Z 
2025-05-07T20:26:41.1202952Z [A[A
2025-05-07T20:26:41.1203064Z 
2025-05-07T20:26:41.1203068Z 
2025-05-07T20:26:41.1203072Z 
2025-05-07T20:26:41.1203186Z [A[A[A
2025-05-07T20:26:41.1203335Z 
2025-05-07T20:26:41.1203349Z 
2025-05-07T20:26:41.1203354Z 
2025-05-07T20:26:41.1203360Z 
2025-05-07T20:26:41.1203494Z [A[A[A[A
2025-05-07T20:26:41.1203611Z 
2025-05-07T20:26:41.1203615Z 
2025-05-07T20:26:41.1203618Z 
2025-05-07T20:26:41.1203622Z 
2025-05-07T20:26:41.1203626Z 
2025-05-07T20:26:41.1203740Z [A[A[A[A[A
2025-05-07T20:26:41.1203867Z 
2025-05-07T20:26:41.1203871Z 
2025-05-07T20:26:41.1203875Z 
2025-05-07T20:26:41.1203878Z 
2025-05-07T20:26:41.1203882Z 
2025-05-07T20:26:41.1203886Z 
2025-05-07T20:26:41.1204002Z [A[A[A[A[A[A
2025-05-07T20:26:41.1204130Z 
2025-05-07T20:26:41.1204136Z 
2025-05-07T20:26:41.1204141Z 
2025-05-07T20:26:41.1204146Z 
2025-05-07T20:26:41.1204152Z 
2025-05-07T20:26:41.1204157Z 
2025-05-07T20:26:41.1204163Z 
2025-05-07T20:26:41.1204345Z [A[A[A[A[A[A[A
2025-05-07T20:26:41.1204483Z 
2025-05-07T20:26:41.1204487Z 
2025-05-07T20:26:41.1204491Z 
2025-05-07T20:26:41.1204495Z 
2025-05-07T20:26:41.1204498Z 
2025-05-07T20:26:41.1204507Z 
2025-05-07T20:26:41.1204511Z 
2025-05-07T20:26:41.1204515Z 
2025-05-07T20:26:41.1204668Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1204879Z 
2025-05-07T20:26:41.1204884Z 
2025-05-07T20:26:41.1204890Z 
2025-05-07T20:26:41.1204895Z 
2025-05-07T20:26:41.1204900Z 
2025-05-07T20:26:41.1204905Z 
2025-05-07T20:26:41.1204911Z 
2025-05-07T20:26:41.1205082Z 
2025-05-07T20:26:41.1205088Z 
2025-05-07T20:26:41.1205269Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1205423Z 
2025-05-07T20:26:41.1205427Z 
2025-05-07T20:26:41.1205431Z 
2025-05-07T20:26:41.1205434Z 
2025-05-07T20:26:41.1205438Z 
2025-05-07T20:26:41.1205442Z 
2025-05-07T20:26:41.1205451Z 
2025-05-07T20:26:41.1205455Z 
2025-05-07T20:26:41.1205459Z 
2025-05-07T20:26:41.1205473Z 
2025-05-07T20:26:41.1205601Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1205756Z 
2025-05-07T20:26:41.1205766Z 
2025-05-07T20:26:41.1205770Z 
2025-05-07T20:26:41.1205773Z 
2025-05-07T20:26:41.1205777Z 
2025-05-07T20:26:41.1205781Z 
2025-05-07T20:26:41.1205870Z 
2025-05-07T20:26:41.1205874Z 
2025-05-07T20:26:41.1205878Z 
2025-05-07T20:26:41.1205881Z 
2025-05-07T20:26:41.1205885Z 
2025-05-07T20:26:41.1206015Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1206247Z 
2025-05-07T20:26:41.1206252Z 
2025-05-07T20:26:41.1206258Z 
2025-05-07T20:26:41.1206263Z 
2025-05-07T20:26:41.1206268Z 
2025-05-07T20:26:41.1206282Z 
2025-05-07T20:26:41.1206288Z 
2025-05-07T20:26:41.1206293Z 
2025-05-07T20:26:41.1206298Z 
2025-05-07T20:26:41.1206303Z 
2025-05-07T20:26:41.1206309Z 
2025-05-07T20:26:41.1206313Z 
2025-05-07T20:26:41.1206462Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1206697Z 
2025-05-07T20:26:41.1206702Z 
2025-05-07T20:26:41.1206707Z 
2025-05-07T20:26:41.1206713Z 
2025-05-07T20:26:41.1206718Z 
2025-05-07T20:26:41.1206723Z 
2025-05-07T20:26:41.1206728Z 
2025-05-07T20:26:41.1206733Z 
2025-05-07T20:26:41.1206739Z 
2025-05-07T20:26:41.1206744Z 
2025-05-07T20:26:41.1206749Z 
2025-05-07T20:26:41.1206754Z 
2025-05-07T20:26:41.1206759Z 
2025-05-07T20:26:41.1206977Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1207230Z 
2025-05-07T20:26:41.1207235Z 
2025-05-07T20:26:41.1207240Z 
2025-05-07T20:26:41.1207246Z 
2025-05-07T20:26:41.1207250Z 
2025-05-07T20:26:41.1207256Z 
2025-05-07T20:26:41.1207261Z 
2025-05-07T20:26:41.1207266Z 
2025-05-07T20:26:41.1207280Z 
2025-05-07T20:26:41.1207286Z 
2025-05-07T20:26:41.1207296Z 
2025-05-07T20:26:41.1207302Z 
2025-05-07T20:26:41.1207307Z 
2025-05-07T20:26:41.1207312Z 
2025-05-07T20:26:41.1207516Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1207847Z 
2025-05-07T20:26:41.1207853Z 
2025-05-07T20:26:41.1207857Z 
2025-05-07T20:26:41.1207862Z 
2025-05-07T20:26:41.1207868Z 
2025-05-07T20:26:41.1207873Z 
2025-05-07T20:26:41.1207878Z 
2025-05-07T20:26:41.1207883Z 
2025-05-07T20:26:41.1207888Z 
2025-05-07T20:26:41.1207894Z 
2025-05-07T20:26:41.1207899Z 
2025-05-07T20:26:41.1207904Z 
2025-05-07T20:26:41.1207909Z 
2025-05-07T20:26:41.1207915Z 
2025-05-07T20:26:41.1207921Z 
2025-05-07T20:26:41.1208148Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1208413Z 
2025-05-07T20:26:41.1208418Z 
2025-05-07T20:26:41.1208423Z 
2025-05-07T20:26:41.1208429Z 
2025-05-07T20:26:41.1208434Z 
2025-05-07T20:26:41.1208439Z 
2025-05-07T20:26:41.1208444Z 
2025-05-07T20:26:41.1208450Z 
2025-05-07T20:26:41.1208455Z 
2025-05-07T20:26:41.1208460Z 
2025-05-07T20:26:41.1208473Z 
2025-05-07T20:26:41.1208478Z 
2025-05-07T20:26:41.1208492Z 
2025-05-07T20:26:41.1208497Z 
2025-05-07T20:26:41.1208502Z 
2025-05-07T20:26:41.1208507Z 
2025-05-07T20:26:41.1208722Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1208994Z 
2025-05-07T20:26:41.1208999Z 
2025-05-07T20:26:41.1209005Z 
2025-05-07T20:26:41.1209018Z 
2025-05-07T20:26:41.1209023Z 
2025-05-07T20:26:41.1209028Z 
2025-05-07T20:26:41.1209034Z 
2025-05-07T20:26:41.1209039Z 
2025-05-07T20:26:41.1209044Z 
2025-05-07T20:26:41.1209050Z 
2025-05-07T20:26:41.1209055Z 
2025-05-07T20:26:41.1209060Z 
2025-05-07T20:26:41.1209065Z 
2025-05-07T20:26:41.1209070Z 
2025-05-07T20:26:41.1209081Z 
2025-05-07T20:26:41.1209087Z 
2025-05-07T20:26:41.1209092Z 
2025-05-07T20:26:41.1209316Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1209601Z 
2025-05-07T20:26:41.1209606Z 
2025-05-07T20:26:41.1209611Z 
2025-05-07T20:26:41.1209617Z 
2025-05-07T20:26:41.1209622Z 
2025-05-07T20:26:41.1209627Z 
2025-05-07T20:26:41.1209777Z 
2025-05-07T20:26:41.1209780Z 
2025-05-07T20:26:41.1209784Z 
2025-05-07T20:26:41.1209787Z 
2025-05-07T20:26:41.1209791Z 
2025-05-07T20:26:41.1209794Z 
2025-05-07T20:26:41.1209798Z 
2025-05-07T20:26:41.1209801Z 
2025-05-07T20:26:41.1209805Z 
2025-05-07T20:26:41.1209816Z 
2025-05-07T20:26:41.1209820Z 
2025-05-07T20:26:41.1209823Z 
2025-05-07T20:26:41.1210006Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1210233Z 
2025-05-07T20:26:41.1210239Z 
2025-05-07T20:26:41.1210388Z [A
2025-05-07T20:26:41.1210493Z 
2025-05-07T20:26:41.1210497Z 
2025-05-07T20:26:41.1210639Z [A[A
2025-05-07T20:26:41.1210773Z 
2025-05-07T20:26:41.1210910Z 
2025-05-07T20:26:41.1210915Z 
2025-05-07T20:26:41.1211055Z [A[A[A
2025-05-07T20:26:41.1211199Z 
2025-05-07T20:26:41.1211202Z 
2025-05-07T20:26:41.1211206Z 
2025-05-07T20:26:41.1211210Z 
2025-05-07T20:26:41.1211359Z [A[A[A[A
2025-05-07T20:26:41.1211508Z 
2025-05-07T20:26:41.1211512Z 
2025-05-07T20:26:41.1211516Z 
2025-05-07T20:26:41.1211519Z 
2025-05-07T20:26:41.1211529Z 
2025-05-07T20:26:41.1211639Z [A[A[A[A[A
2025-05-07T20:26:41.1211773Z 
2025-05-07T20:26:41.1211777Z 
2025-05-07T20:26:41.1211781Z 
2025-05-07T20:26:41.1211784Z 
2025-05-07T20:26:41.1211788Z 
2025-05-07T20:26:41.1211792Z 
2025-05-07T20:26:41.1211906Z [A[A[A[A[A[A
2025-05-07T20:26:41.1212043Z 
2025-05-07T20:26:41.1212049Z 
2025-05-07T20:26:41.1212054Z 
2025-05-07T20:26:41.1212059Z 
2025-05-07T20:26:41.1212064Z 
2025-05-07T20:26:41.1212070Z 
2025-05-07T20:26:41.1212075Z 
2025-05-07T20:26:41.1212240Z [A[A[A[A[A[A[A
2025-05-07T20:26:41.1212387Z 
2025-05-07T20:26:41.1212390Z 
2025-05-07T20:26:41.1212394Z 
2025-05-07T20:26:41.1212404Z 
2025-05-07T20:26:41.1212408Z 
2025-05-07T20:26:41.1212412Z 
2025-05-07T20:26:41.1212418Z 
2025-05-07T20:26:41.1212423Z 
2025-05-07T20:26:41.1212599Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1212762Z 
2025-05-07T20:26:41.1212766Z 
2025-05-07T20:26:41.1212769Z 
2025-05-07T20:26:41.1212773Z 
2025-05-07T20:26:41.1212777Z 
2025-05-07T20:26:41.1212786Z 
2025-05-07T20:26:41.1212789Z 
2025-05-07T20:26:41.1212793Z 
2025-05-07T20:26:41.1212797Z 
2025-05-07T20:26:41.1212976Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1213145Z 
2025-05-07T20:26:41.1213148Z 
2025-05-07T20:26:41.1213154Z 
2025-05-07T20:26:41.1213163Z 
2025-05-07T20:26:41.1213177Z 
2025-05-07T20:26:41.1213182Z 
2025-05-07T20:26:41.1213188Z 
2025-05-07T20:26:41.1213193Z 
2025-05-07T20:26:41.1213198Z 
2025-05-07T20:26:41.1213203Z 
2025-05-07T20:26:41.1213385Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1213558Z 
2025-05-07T20:26:41.1213564Z 
2025-05-07T20:26:41.1213569Z 
2025-05-07T20:26:41.1213574Z 
2025-05-07T20:26:41.1213587Z 
2025-05-07T20:26:41.1213592Z 
2025-05-07T20:26:41.1213597Z 
2025-05-07T20:26:41.1213603Z 
2025-05-07T20:26:41.1213608Z 
2025-05-07T20:26:41.1213613Z 
2025-05-07T20:26:41.1213618Z 
2025-05-07T20:26:41.1213830Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1214068Z 
2025-05-07T20:26:41.1214073Z 
2025-05-07T20:26:41.1214078Z 
2025-05-07T20:26:41.1214090Z 
2025-05-07T20:26:41.1214095Z 
2025-05-07T20:26:41.1214100Z 
2025-05-07T20:26:41.1214105Z 
2025-05-07T20:26:41.1214118Z 
2025-05-07T20:26:41.1214124Z 
2025-05-07T20:26:41.1214129Z 
2025-05-07T20:26:41.1214134Z 
2025-05-07T20:26:41.1214139Z 
2025-05-07T20:26:41.1214335Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1214584Z 
2025-05-07T20:26:41.1214590Z 
2025-05-07T20:26:41.1214595Z 
2025-05-07T20:26:41.1214609Z 
2025-05-07T20:26:41.1214614Z 
2025-05-07T20:26:41.1214619Z 
2025-05-07T20:26:41.1214625Z 
2025-05-07T20:26:41.1214630Z 
2025-05-07T20:26:41.1214635Z 
2025-05-07T20:26:41.1214640Z 
2025-05-07T20:26:41.1214646Z 
2025-05-07T20:26:41.1214670Z 
2025-05-07T20:26:41.1214675Z 
2025-05-07T20:26:41.1214864Z [A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1215125Z 
2025-05-07T20:26:41.1215130Z 
2025-05-07T20:26:41.1215135Z 
2025-05-07T20:26:41.1215141Z 
2025-05-07T20:26:41.1215146Z 
2025-05-07T20:26:41.1215152Z 
2025-05-07T20:26:41.1215157Z 
2025-05-07T20:26:41.1215163Z 
2025-05-07T20:26:41.1215288Z 
2025-05-07T20:26:41.1215293Z 
2025-05-07T20:26:41.1215298Z 
2025-05-07T20:26:41.1215304Z 
2025-05-07T20:26:41.1215309Z 
2025-05-07T20:26:41.1215314Z 
2025-05-07T20:26:41.1215525Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1215783Z 
2025-05-07T20:26:41.1215789Z 
2025-05-07T20:26:41.1215794Z 
2025-05-07T20:26:41.1215799Z 
2025-05-07T20:26:41.1215804Z 
2025-05-07T20:26:41.1215810Z 
2025-05-07T20:26:41.1215815Z 
2025-05-07T20:26:41.1215820Z 
2025-05-07T20:26:41.1215825Z 
2025-05-07T20:26:41.1215830Z 
2025-05-07T20:26:41.1215846Z 
2025-05-07T20:26:41.1215851Z 
2025-05-07T20:26:41.1215856Z 
2025-05-07T20:26:41.1215943Z 
2025-05-07T20:26:41.1215949Z 
2025-05-07T20:26:41.1216165Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1216407Z 
2025-05-07T20:26:41.1216410Z 
2025-05-07T20:26:41.1216414Z 
2025-05-07T20:26:41.1216418Z 
2025-05-07T20:26:41.1216422Z 
2025-05-07T20:26:41.1216425Z 
2025-05-07T20:26:41.1216429Z 
2025-05-07T20:26:41.1216432Z 
2025-05-07T20:26:41.1216443Z 
2025-05-07T20:26:41.1216446Z 
2025-05-07T20:26:41.1216450Z 
2025-05-07T20:26:41.1216454Z 
2025-05-07T20:26:41.1216457Z 
2025-05-07T20:26:41.1216461Z 
2025-05-07T20:26:41.1216465Z 
2025-05-07T20:26:41.1216468Z 
2025-05-07T20:26:41.1216625Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1216838Z 
2025-05-07T20:26:41.1216843Z 
2025-05-07T20:26:41.1216848Z 
2025-05-07T20:26:41.1216853Z 
2025-05-07T20:26:41.1216859Z 
2025-05-07T20:26:41.1216864Z 
2025-05-07T20:26:41.1216869Z 
2025-05-07T20:26:41.1216874Z 
2025-05-07T20:26:41.1216879Z 
2025-05-07T20:26:41.1216885Z 
2025-05-07T20:26:41.1216890Z 
2025-05-07T20:26:41.1216902Z 
2025-05-07T20:26:41.1216907Z 
2025-05-07T20:26:41.1216921Z 
2025-05-07T20:26:41.1216927Z 
2025-05-07T20:26:41.1216932Z 
2025-05-07T20:26:41.1216938Z 
2025-05-07T20:26:41.1217148Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1217415Z 
2025-05-07T20:26:41.1217420Z 
2025-05-07T20:26:41.1217425Z 
2025-05-07T20:26:41.1217431Z 
2025-05-07T20:26:41.1217454Z 
2025-05-07T20:26:41.1217459Z 
2025-05-07T20:26:41.1217464Z 
2025-05-07T20:26:41.1217469Z 
2025-05-07T20:26:41.1217475Z 
2025-05-07T20:26:41.1217480Z 
2025-05-07T20:26:41.1217485Z 
2025-05-07T20:26:41.1217490Z 
2025-05-07T20:26:41.1217495Z 
2025-05-07T20:26:41.1217501Z 
2025-05-07T20:26:41.1217506Z 
2025-05-07T20:26:41.1217511Z 
2025-05-07T20:26:41.1217516Z 
2025-05-07T20:26:41.1217522Z 
2025-05-07T20:26:41.1217750Z [A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1218042Z 
2025-05-07T20:26:41.1218047Z 
2025-05-07T20:26:41.1218194Z [A
2025-05-07T20:26:41.1218336Z 
2025-05-07T20:26:41.1218341Z 
2025-05-07T20:26:41.1218489Z [A[A
2025-05-07T20:26:41.1218630Z 
2025-05-07T20:26:41.1218644Z 
2025-05-07T20:26:41.1218649Z 
2025-05-07T20:26:41.1218794Z [A[A[A
2025-05-07T20:26:41.1218937Z 
2025-05-07T20:26:41.1218942Z 
2025-05-07T20:26:41.1218948Z 
2025-05-07T20:26:41.1218953Z 
2025-05-07T20:26:41.1219104Z [A[A[A[A
2025-05-07T20:26:41.1219259Z 
2025-05-07T20:26:41.1219270Z 
2025-05-07T20:26:41.1219275Z 
2025-05-07T20:26:41.1219280Z 
2025-05-07T20:26:41.1219284Z 
2025-05-07T20:26:41.1219437Z [A[A[A[A[A
2025-05-07T20:26:41.1219600Z 
2025-05-07T20:26:41.1219605Z 
2025-05-07T20:26:41.1219610Z 
2025-05-07T20:26:41.1219616Z 
2025-05-07T20:26:41.1219621Z 
2025-05-07T20:26:41.1219626Z 
2025-05-07T20:26:41.1219780Z [A[A[A[A[A[A
2025-05-07T20:26:41.1219948Z 
2025-05-07T20:26:41.1219954Z 
2025-05-07T20:26:41.1219959Z 
2025-05-07T20:26:41.1219964Z 
2025-05-07T20:26:41.1219969Z 
2025-05-07T20:26:41.1219974Z 
2025-05-07T20:26:41.1219980Z 
2025-05-07T20:26:41.1220142Z [A[A[A[A[A[A[A
2025-05-07T20:26:41.1220323Z 
2025-05-07T20:26:41.1220327Z 
2025-05-07T20:26:41.1220330Z 
2025-05-07T20:26:41.1220334Z 
2025-05-07T20:26:41.1220338Z 
2025-05-07T20:26:41.1220341Z 
2025-05-07T20:26:41.1220345Z 
2025-05-07T20:26:41.1220349Z 
2025-05-07T20:26:41.1220506Z [A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1220699Z 
2025-05-07T20:26:41.1220703Z 
2025-05-07T20:26:41.1220707Z 
2025-05-07T20:26:41.1220814Z 
2025-05-07T20:26:41.1220818Z 
2025-05-07T20:26:41.1220822Z 
2025-05-07T20:26:41.1220825Z 
2025-05-07T20:26:41.1220829Z 
2025-05-07T20:26:41.1220833Z 
2025-05-07T20:26:41.1220990Z [A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1221202Z 
2025-05-07T20:26:41.1221208Z 
2025-05-07T20:26:41.1221213Z 
2025-05-07T20:26:41.1221218Z 
2025-05-07T20:26:41.1221223Z 
2025-05-07T20:26:41.1221229Z 
2025-05-07T20:26:41.1221234Z 
2025-05-07T20:26:41.1221240Z 
2025-05-07T20:26:41.1221245Z 
2025-05-07T20:26:41.1221259Z 
2025-05-07T20:26:41.1221391Z [A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1221552Z 
2025-05-07T20:26:41.1221556Z 
2025-05-07T20:26:41.1221643Z 
2025-05-07T20:26:41.1221648Z 
2025-05-07T20:26:41.1221651Z 
2025-05-07T20:26:41.1221655Z 
2025-05-07T20:26:41.1221665Z 
2025-05-07T20:26:41.1221669Z 
2025-05-07T20:26:41.1221672Z 
2025-05-07T20:26:41.1221676Z 
2025-05-07T20:26:41.1221680Z 
2025-05-07T20:26:41.1221811Z [A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1221981Z 
2025-05-07T20:26:41.1221990Z 
2025-05-07T20:26:41.1221994Z 
2025-05-07T20:26:41.1222006Z 
2025-05-07T20:26:41.1222010Z 
2025-05-07T20:26:41.1222013Z 
2025-05-07T20:26:41.1222017Z 
2025-05-07T20:26:41.1222021Z 
2025-05-07T20:26:41.1222024Z 
2025-05-07T20:26:41.1222028Z 
2025-05-07T20:26:41.1222032Z 
2025-05-07T20:26:41.1222035Z 
2025-05-07T20:26:41.1222165Z [A[A[A[A[A[A[A[A[A[A[A[A
2025-05-07T20:26:41.1222343Z 
2025-05-07T20:26:41.1222347Z 
2025-05-07T20:26:41.1222351Z 
2025-05-07T20:26:41.1222354Z 
2025-05-07T20:26:41.1222358Z 
2025-05-07T20:26:41.1222362Z 
2025-05-07T20:26:41.1222365Z 
2025-05-07T20:26:41.1222369Z 
2025-05-07T20:26:41.1222373Z 
2025-05-07T20:26:41.1222380Z 
2025-05-07T20:26:41.1222383Z 
2025-05-07T20:26:41.1222387Z 
2025-05-07T20:26:41.1222391Z 
2025-05-07T20:26:41.1224047Z [A[A[A[A[A[A[A[A[A[A[A[A[A done
2025-05-07T20:26:41.4336279Z Preparing transaction: \ | / done
2025-05-07T20:26:46.2787628Z Verifying transaction: \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / done
2025-05-07T20:26:47.4876430Z Executing transaction: \ | / - \ | / - \ | / - done
2025-05-07T20:26:50.0920926Z [INSTALL] Fixing file placements for CUDA 12.8.0+ ...
2025-05-07T20:26:50.0921333Z [INSTALL] Creating symlinks: libnvToolsExt.so
2025-05-07T20:26:50.0922022Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so
2025-05-07T20:26:50.0922592Z 
2025-05-07T20:26:50.0935553Z 
2025-05-07T20:26:50.0936569Z + ln -sf /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so.1 /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so
2025-05-07T20:26:50.0937273Z 
2025-05-07T20:26:50.0950084Z 
2025-05-07T20:26:50.0950252Z [INSTALL] Copying nvtx3 headers ...
2025-05-07T20:26:50.0955762Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/include/
2025-05-07T20:26:50.0959685Z 
2025-05-07T20:26:50.2528709Z 
2025-05-07T20:26:50.2534350Z + cp -r /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCuda.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtCudaRt.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtOpenCL.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvToolsExtSync.h /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtx3.hpp /home/ec2-user/miniconda/envs/build_binary/nsight-compute-2025.1.0/host/target-linux-x64/nvtx/include/nvtx3/nvtxDetail /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/
2025-05-07T20:26:50.2538353Z 
2025-05-07T20:26:50.2558447Z 
2025-05-07T20:26:50.2558719Z [INSTALL] Appending libcuda.so path to LD_LIBRARY_PATH ...
2025-05-07T20:26:50.2925607Z [ENV] Appending to LD_LIBRARY_PATH: /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs ...
2025-05-07T20:26:52.1700074Z ERROR conda.cli.main_run:execute(125): `conda run printenv LD_LIBRARY_PATH` failed. (See above for error)
2025-05-07T20:26:52.2324758Z + conda env config vars set -n build_binary LD_LIBRARY_PATH=/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs
2025-05-07T20:26:52.2325286Z 
2025-05-07T20:26:52.6718127Z 
2025-05-07T20:26:52.6729597Z [INSTALL] Setting environment variable NVML_LIB_PATH ...
2025-05-07T20:26:52.7084184Z + conda env config vars set -n build_binary NVML_LIB_PATH=/home/ec2-user/miniconda/envs/build_binary/lib/stubs/libnvidia-ml.so
2025-05-07T20:26:52.7084876Z 
2025-05-07T20:26:53.1426315Z 
2025-05-07T20:26:53.1426801Z [INSTALL] Setting environment variable CUDA_INCLUDE_DIRS ...
2025-05-07T20:26:53.1428132Z + conda env config vars set -n build_binary CUDA_INCLUDE_DIRS="/home/ec2-user/miniconda/envs/build_binary/include/:/home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/"
2025-05-07T20:26:53.1429144Z 
2025-05-07T20:26:53.5632764Z 
2025-05-07T20:26:55.5824173Z [CHECK] cuda_runtime.h found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/include/cuda_runtime.h
2025-05-07T20:26:57.6025821Z [CHECK] libcuda.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libcuda.so
2025-05-07T20:26:59.6412783Z [CHECK] libnvToolsExt.so found in CONDA_PREFIX PATH (symbolic link): /home/ec2-user/miniconda/envs/build_binary/lib/libnvToolsExt.so
2025-05-07T20:26:59.6413878Z /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/libnvToolsExt.so
2025-05-07T20:27:01.6804840Z [CHECK] libnvidia-ml.so found in CONDA_PREFIX PATH (file): /home/ec2-user/miniconda/envs/build_binary/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
2025-05-07T20:27:03.5731284Z /home/ec2-user/miniconda/envs/build_binary/bin/nvcc
2025-05-07T20:27:03.5731586Z 
2025-05-07T20:27:03.6353850Z [CHECK] Binary nvcc found in PATH
2025-05-07T20:27:07.4942675Z /tmp/tmpy1v2dtjh: line 3: clang: command not found
2025-05-07T20:27:07.4942964Z 
2025-05-07T20:27:07.4943638Z ERROR conda.cli.main_run:execute(125): `conda run clang --version` failed. (See above for error)
2025-05-07T20:27:07.5585891Z + ls -la /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d
2025-05-07T20:27:07.5586199Z 
2025-05-07T20:27:07.5606095Z total 36
2025-05-07T20:27:07.5606378Z drwxr-xr-x. 2 ec2-user ec2-user   191 May  7 20:26 .
2025-05-07T20:27:07.5606772Z drwxr-xr-x. 5 ec2-user ec2-user    62 May  7 20:25 ..
2025-05-07T20:27:07.5607215Z -rw-r--r--. 2 ec2-user ec2-user  3778 Jun 10  2024 activate-binutils_linux-64.sh
2025-05-07T20:27:07.5608715Z -rw-r--r--. 2 ec2-user ec2-user 11630 Jun 10  2024 activate-gcc_linux-64.sh
2025-05-07T20:27:07.5609183Z -rw-r--r--. 2 ec2-user ec2-user  5190 Jun 10  2024 activate-gxx_linux-64.sh
2025-05-07T20:27:07.5609994Z -rw-r--r--. 2 ec2-user ec2-user   136 Mar 27 01:27 libglib_activate.sh
2025-05-07T20:27:07.5610422Z -rw-r--r--. 2 ec2-user ec2-user   872 Nov 13 09:20 libxml2_activate.sh
2025-05-07T20:27:07.5610870Z -rw-r--r--. 2 ec2-user ec2-user  2932 Jan 24 22:22 ~cuda-nvcc_activate.sh
2025-05-07T20:27:07.5611151Z 
2025-05-07T20:27:07.5611368Z [INSTALL] Removing the -ccbin=CXX hook from NVCC activation scripts ...
2025-05-07T20:27:07.5612006Z + sed -i /-ccbin=/d /home/ec2-user/miniconda/envs/build_binary/etc/conda/activate.d/*cuda-nvcc_activate.sh
2025-05-07T20:27:07.5612425Z 
2025-05-07T20:27:07.5631469Z 
2025-05-07T20:27:07.5632160Z + conda run -n build_binary c++ --version | grep -i clang
2025-05-07T20:27:07.5632420Z 
2025-05-07T20:27:09.5173404Z 
2025-05-07T20:27:09.5174002Z [BUILD] Setting prepend flags for NVCC ...
2025-05-07T20:27:09.5174529Z + conda env config vars set -n build_binary NVCC_PREPEND_FLAGS="-allow-unsupported-compiler"
2025-05-07T20:27:09.5174903Z 
2025-05-07T20:27:09.9426205Z 
2025-05-07T20:27:09.9426566Z + conda run -n build_binary printenv NVCC_PREPEND_FLAGS
2025-05-07T20:27:09.9426824Z 
2025-05-07T20:27:11.8330008Z -allow-unsupported-compiler
2025-05-07T20:27:11.8330329Z 
2025-05-07T20:27:11.8955613Z 
2025-05-07T20:27:11.8955979Z [INFO] Printing out all preprocessor defines in nvcc ...
2025-05-07T20:27:11.8956776Z + conda run -n build_binary nvcc --compiler-options -dM -E -x cu - < /dev/null
2025-05-07T20:27:11.8957212Z 
2025-05-07T20:27:13.8437533Z #define _GLIBCXX_DEPRECATED_SUGGEST(ALT) __attribute__ ((__deprecated__ ("use '" ALT "' instead")))
2025-05-07T20:27:13.8438239Z #define M_PIl 3.141592653589793238462643383279502884L
2025-05-07T20:27:13.8438572Z #define _IO_CURRENTLY_PUTTING 0x800
2025-05-07T20:27:13.8438889Z #define __W_EXITCODE(ret,sig) ((ret) << 8 | (sig))
2025-05-07T20:27:13.8439214Z #define __DBL_MIN_EXP__ (-1021)
2025-05-07T20:27:13.8439554Z #define _STL_PAIR_H 1
2025-05-07T20:27:13.8439884Z #define __cpp_attributes 200809L
2025-05-07T20:27:13.8441415Z #define __cpp_nontype_template_parameter_auto 201606L
2025-05-07T20:27:13.8441886Z #define __DELETE_THROW throw()
2025-05-07T20:27:13.8442226Z #define _PTRDIFF_T_ 
2025-05-07T20:27:13.8442581Z #define M_PI_4 0.78539816339744830962
2025-05-07T20:27:13.8442983Z #define __UINT_LEAST16_MAX__ 0xffff
2025-05-07T20:27:13.8443358Z #define _IO_LEFT 02
2025-05-07T20:27:13.8443666Z #define __ATOMIC_ACQUIRE 2
2025-05-07T20:27:13.8444027Z #define _POSIX2_BC_SCALE_MAX 99
2025-05-07T20:27:13.8444414Z #define _GLIBCXX_USE_RANDOM_TR1 1
2025-05-07T20:27:13.8445050Z #define _GLIBCXX_MOVE_BACKWARD3(_Tp,_Up,_Vp) std::move_backward(_Tp, _Up, _Vp)
2025-05-07T20:27:13.8445625Z #define __FLT128_MAX_10_EXP__ 4932
2025-05-07T20:27:13.8445994Z #define RE_DUP_MAX (0x7fff)
2025-05-07T20:27:13.8446231Z #define _IOS_OUTPUT 2
2025-05-07T20:27:13.8446456Z #define __SM_100_RT_HPP__ 
2025-05-07T20:27:13.8446755Z #define __FLT_MIN__ 1.17549435082228750796873653722224568e-38F
2025-05-07T20:27:13.8447190Z #define toascii_l(c,l) __toascii_l ((c), (l))
2025-05-07T20:27:13.8447627Z #define __GCC_IEC_559_COMPLEX 2
2025-05-07T20:27:13.8447930Z #define _GLIBCXX_USE_FCHMOD 1
2025-05-07T20:27:13.8448269Z #define __cpp_aggregate_nsdmi 201304L
2025-05-07T20:27:13.8458601Z #define __bswap_16(x) (__extension__ ({ unsigned short int __v, __x = (unsigned short int) (x); if (__builtin_constant_p (__x)) __v = __bswap_constant_16 (__x); else __asm__ ("rorw $8, %w0" : "=r" (__v) : "0" (__x) : "cc"); __v; }))
2025-05-07T20:27:13.8459652Z #define __UINT_LEAST8_TYPE__ unsigned char
2025-05-07T20:27:13.8460117Z #define __SIZEOF_FLOAT80__ 16
2025-05-07T20:27:13.8460524Z #define cudaTextureTypeCubemapLayered 0xFC
2025-05-07T20:27:13.8460965Z #define _T_WCHAR_ 
2025-05-07T20:27:13.8461272Z #define stdout stdout
2025-05-07T20:27:13.8461714Z #define _GLIBCXX_ABI_TAG_CXX11 __attribute ((__abi_tag__ ("cxx11")))
2025-05-07T20:27:13.8462221Z #define CHAR_BIT __CHAR_BIT__
2025-05-07T20:27:13.8462563Z #define __flexarr []
2025-05-07T20:27:13.8462896Z #define _GLIBCXX_HAVE_FINITEF 1
2025-05-07T20:27:13.8464299Z nvcc warning : Support for offline compilation for architectures prior to '<compute/sm/lto>_75' will be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
2025-05-07T20:27:13.8465315Z 
2025-05-07T20:27:13.8465541Z #define __islower_l(c,l) __isctype_l((c), _ISlower, (l))
2025-05-07T20:27:13.8465999Z #define _IO_FLAGS2_USER_WBUF 8
2025-05-07T20:27:13.8466292Z #define _MATH_H 1
2025-05-07T20:27:13.8466567Z #define cudaOccupancyDisableCachingOverride 0x01
2025-05-07T20:27:13.8466896Z #define __S64_TYPE long int
2025-05-07T20:27:13.8467133Z #define __stub_fchflags 
2025-05-07T20:27:13.8467700Z #define cudaDeviceScheduleMask 0x07
2025-05-07T20:27:13.8468062Z #define __SQUAD_TYPE long int
2025-05-07T20:27:13.8468313Z #define __INTMAX_C(c) c ## L
2025-05-07T20:27:13.8468607Z #define cudaStreamFireAndForget ((cudaStream_t)0x4)
2025-05-07T20:27:13.8468934Z #define _BSD_SIZE_T_DEFINED_ 
2025-05-07T20:27:13.8469185Z #define NL_NMAX INT_MAX
2025-05-07T20:27:13.8469413Z #define _BITS_TIME_H 1
2025-05-07T20:27:13.8469690Z #define M_LN10l 2.302585092994045684017991454684364208L
2025-05-07T20:27:13.8470003Z #define _GLIBCXX_TXN_SAFE_DYN 
2025-05-07T20:27:13.8470306Z #define cudaStreamTailLaunch ((cudaStream_t)0x3)
2025-05-07T20:27:13.8470650Z #define M_El 2.718281828459045235360287471352662498L
2025-05-07T20:27:13.8471031Z #define _PSTL_PRAGMA_DECLARE_SIMD _PSTL_PRAGMA(omp declare simd)
2025-05-07T20:27:13.8471394Z #define __CHAR_BIT__ 8
2025-05-07T20:27:13.8471649Z #define __FSWORD_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:27:13.8472029Z #define _PSTL_STRING_CONCAT(x,y) x #y
2025-05-07T20:27:13.8472332Z #define _GLIBCXX98_USE_C99_MATH 1
2025-05-07T20:27:13.8472594Z #define FP_NAN 0
2025-05-07T20:27:13.8472851Z #define makedev(maj,min) gnu_dev_makedev (maj, min)
2025-05-07T20:27:13.8473253Z #define cudaGetDeviceProperties cudaGetDeviceProperties_v2
2025-05-07T20:27:13.8473631Z #define __cudaCDP2GetErrorString 
2025-05-07T20:27:13.8473907Z #define SHRT_MAX __SHRT_MAX__
2025-05-07T20:27:13.8474162Z #define _GLIBCXX_X86_RDSEED 1
2025-05-07T20:27:13.8474408Z #define __SM_80_RT_H__ 
2025-05-07T20:27:13.8474628Z #define _NEW 
2025-05-07T20:27:13.8474842Z #define CLOCK_PROCESS_CPUTIME_ID 2
2025-05-07T20:27:13.8475111Z #define __UINT8_MAX__ 0xff
2025-05-07T20:27:13.8475468Z #define _PSTL_ASSERT_MSG(_Condition,_Message) __glibcxx_assert(_Condition)
2025-05-07T20:27:13.8475872Z #define __SCHAR_WIDTH__ 8
2025-05-07T20:27:13.8476096Z #define __USE_ANSI 1
2025-05-07T20:27:13.8476373Z #define _IO_BE(expr,res) __builtin_expect ((expr), res)
2025-05-07T20:27:13.8476758Z #define __isupper_l(c,l) __isctype_l((c), _ISupper, (l))
2025-05-07T20:27:13.8477107Z #define __cudaCDP2Memcpy2DAsync_ptsz 
2025-05-07T20:27:13.8477401Z #define __WINT_MAX__ 0xffffffffU
2025-05-07T20:27:13.8477671Z #define __SIZEOF_PTHREAD_ATTR_T 56
2025-05-07T20:27:13.8477939Z #define __FLT32_MIN_EXP__ (-125)
2025-05-07T20:27:13.8478210Z #define _GLIBCXX_END_NAMESPACE_LDBL 
2025-05-07T20:27:13.8478488Z #define PIPE_BUF 4096
2025-05-07T20:27:13.8478802Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC_2ARGS(PRM1,PRM2) 
2025-05-07T20:27:13.8479247Z #define _GLIBCXX_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_NAMESPACE_CXX11
2025-05-07T20:27:13.8479613Z #define ADJ_TICK 0x4000
2025-05-07T20:27:13.8479885Z #define _PSTL_VERSION_PATCH (_PSTL_VERSION % 10)
2025-05-07T20:27:13.8480201Z #define MQ_PRIO_MAX 32768
2025-05-07T20:27:13.8480446Z #define __SIZEOF_PTHREAD_MUTEXATTR_T 4
2025-05-07T20:27:13.8480755Z #define __WAIT_INT(status) (*(int *) &(status))
2025-05-07T20:27:13.8481200Z #define __GLIBC_PREREQ(maj,min) ((__GLIBC__ << 16) + __GLIBC_MINOR__ >= ((maj) << 16) + (min))
2025-05-07T20:27:13.8481701Z #define cudaCooperativeLaunchMultiDeviceNoPreSync 0x01
2025-05-07T20:27:13.8482058Z #define _XOPEN_SOURCE 700
2025-05-07T20:27:13.8482306Z #define _POSIX2_BC_DIM_MAX 2048
2025-05-07T20:27:13.8482565Z #define __VECTOR_FUNCTIONS_HPP__ 
2025-05-07T20:27:13.8482843Z #define __cpp_static_assert 201411L
2025-05-07T20:27:13.8483114Z #define __GLIBCXX__ 20230528
2025-05-07T20:27:13.8483477Z #define _GLIBCXX_HAVE_STRXFRM_L 1
2025-05-07T20:27:13.8483741Z #define _POSIX_TTY_NAME_MAX 9
2025-05-07T20:27:13.8484014Z #define _GLIBCXX_USE_WEAK_REF __GXX_WEAK__
2025-05-07T20:27:13.8484307Z #define __OFF_T_MATCHES_OFF64_T 1
2025-05-07T20:27:13.8484568Z #define __ORDER_LITTLE_ENDIAN__ 1234
2025-05-07T20:27:13.8484858Z #define __SIZE_MAX__ 0xffffffffffffffffUL
2025-05-07T20:27:13.8485203Z #define __ispunct_l(c,l) __isctype_l((c), _ISpunct, (l))
2025-05-07T20:27:13.8485525Z #define __WCHAR_MAX__ 0x7fffffff
2025-05-07T20:27:13.8485795Z #define _GLIBCXX_USE_CLOCK_MONOTONIC 1
2025-05-07T20:27:13.8486177Z #define __BLKCNT_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:27:13.8486518Z #define __isprint_l(c,l) __isctype_l((c), _ISprint, (l))
2025-05-07T20:27:13.8486867Z #define cudaNvSciSyncAttrSignal 0x1
2025-05-07T20:27:13.8487147Z #define _GLIBCXX_USE_LONG_LONG 1
2025-05-07T20:27:13.8487424Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_1 1
2025-05-07T20:27:13.8487741Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_2 1
2025-05-07T20:27:13.8488060Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_4 1
2025-05-07T20:27:13.8488495Z #define __DBL_DENORM_MIN__ double(4.94065645841246544176568792868221372e-324L)
2025-05-07T20:27:13.8488887Z #define __GCC_HAVE_SYNC_COMPARE_AND_SWAP_8 1
2025-05-07T20:27:13.8489183Z #define ADJ_ESTERROR 0x0008
2025-05-07T20:27:13.8489442Z #define __GCC_ATOMIC_CHAR_LOCK_FREE 2
2025-05-07T20:27:13.8489703Z #define __GCC_IEC_559 2
2025-05-07T20:27:13.8489984Z #define __cpp_lib_transformation_trait_aliases 201304
2025-05-07T20:27:13.8490310Z #define _IO_flockfile(_fp) 
2025-05-07T20:27:13.8490553Z #define CLOCK_MONOTONIC_RAW 4
2025-05-07T20:27:13.8490820Z #define __FLT32X_DECIMAL_DIG__ 17
2025-05-07T20:27:13.8491074Z #define _IOFBF 0
2025-05-07T20:27:13.8491270Z #define __USE_BSD 1
2025-05-07T20:27:13.8491488Z #define __FLT_EVAL_METHOD__ 0
2025-05-07T20:27:13.8491746Z #define SHRT_MIN (-SHRT_MAX - 1)
2025-05-07T20:27:13.8491999Z #define _IO_USER_LOCK 0x8000
2025-05-07T20:27:13.8492240Z #define _IO_NO_WRITES 8
2025-05-07T20:27:13.8492494Z #define _GLIBCXX_PSEUDO_VISIBILITY(V) 
2025-05-07T20:27:13.8492836Z #define __ASMNAME2(prefix,cname) __STRING (prefix) cname
2025-05-07T20:27:13.8493177Z #define _GLIBCXX_HAVE_SYS_STAT_H 1
2025-05-07T20:27:13.8493473Z #define MB_CUR_MAX (__ctype_get_mb_cur_max ())
2025-05-07T20:27:13.8493780Z #define __cpp_binary_literals 201304L
2025-05-07T20:27:13.8494057Z #define _CPP_TYPE_TRAITS_H 1
2025-05-07T20:27:13.8494314Z #define __BEGIN_NAMESPACE_C99 
2025-05-07T20:27:13.8494572Z #define __FLT64_DECIMAL_DIG__ 17
2025-05-07T20:27:13.8494869Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(A) 
2025-05-07T20:27:13.8495247Z #define _G_HAVE_ST_BLKSIZE defined (_STATBUF_ST_BLKSIZE)
2025-05-07T20:27:13.8495600Z #define __cpp_noexcept_function_type 201510L
2025-05-07T20:27:13.8495890Z #define M_PI 3.14159265358979323846
2025-05-07T20:27:13.8496193Z #define _GLIBCXX_PACKAGE_NAME "package-unused"
2025-05-07T20:27:13.8496507Z #define _GLIBCXX_HAVE_BUILTIN_IS_SAME 1
2025-05-07T20:27:13.8496811Z #define __GCC_ATOMIC_CHAR32_T_LOCK_FREE 2
2025-05-07T20:27:13.8497098Z #define _POSIX_DELAYTIMER_MAX 32
2025-05-07T20:27:13.8497364Z #define _GLIBCXX_USE_UTIME 1
2025-05-07T20:27:13.8497622Z #define _STL_ITERATOR_BASE_FUNCS_H 1
2025-05-07T20:27:13.8498204Z #define _IO_peekc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) && __underflow (_fp) == EOF ? EOF : *(unsigned char *) (_fp)->_IO_read_ptr)
2025-05-07T20:27:13.8498773Z #define _GLIBCXX_TR1_ELL_INTEGRAL_TCC 1
2025-05-07T20:27:13.8499090Z #define w_termsig __wait_terminated.__w_termsig
2025-05-07T20:27:13.8499391Z #define __FLOAT_WORD_ORDER __BYTE_ORDER
2025-05-07T20:27:13.8499687Z #define __cudaCDP2GetErrorName 
2025-05-07T20:27:13.8499960Z #define XATTR_SIZE_MAX 65536
2025-05-07T20:27:13.8500298Z #define be64toh(x) __bswap_64 (x)
2025-05-07T20:27:13.8500734Z #define __ASSERT_VOID_CAST static_cast<void>
2025-05-07T20:27:13.8501192Z #define __cpp_variadic_templates 200704L
2025-05-07T20:27:13.8501506Z #define RAND_MAX 2147483647
2025-05-07T20:27:13.8501861Z #define _GLIBCXX_USE_C99_COMPLEX_TR1 1
2025-05-07T20:27:13.8502176Z #define __UINT_FAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:27:13.8502476Z #define __SM_90_RT_H__ 
2025-05-07T20:27:13.8502704Z #define __SIG_ATOMIC_TYPE__ int
2025-05-07T20:27:13.8502953Z #define __COMPAR_FN_T 
2025-05-07T20:27:13.8503184Z #define __GID_T_TYPE __U32_TYPE
2025-05-07T20:27:13.8503431Z #define _IO_BAD_SEEN 0x4000
2025-05-07T20:27:13.8503892Z #define _PSTL_PRAGMA_MESSAGE_IMPL(x) _PSTL_PRAGMA(message(_PSTL_STRING_CONCAT(_PSTL_PRAGMA_LOCATION, x)))
2025-05-07T20:27:13.8504393Z #define __DBL_MIN_10_EXP__ (-307)
2025-05-07T20:27:13.8504825Z #define __glibcxx_requires_sorted_pred(_First,_Last,_Pred) 
2025-05-07T20:27:13.8505170Z #define __FINITE_MATH_ONLY__ 0
2025-05-07T20:27:13.8505461Z #define _PSTL_PRAGMA_SIMD_INCLUSIVE_SCAN(PRM) 
2025-05-07T20:27:13.8505780Z #define cudaArrayColorAttachment 0x20
2025-05-07T20:27:13.8506074Z #define __cpp_variable_templates 201304L
2025-05-07T20:27:13.8506581Z #define cudaKernelNodeAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap
2025-05-07T20:27:13.8507122Z #define __cpp_lib_integral_constant_callable 201304
2025-05-07T20:27:13.8507433Z #define _GLIBCXX_HAVE_SINHF 1
2025-05-07T20:27:13.8507815Z #define MOD_TIMECONST ADJ_TIMECONST
2025-05-07T20:27:13.8508103Z #define __cpp_lib_result_of_sfinae 201210
2025-05-07T20:27:13.8508418Z #define __SM_30_INTRINSICS_H__ 
2025-05-07T20:27:13.8508702Z #define __FLT32X_MAX_EXP__ 1024
2025-05-07T20:27:13.8508961Z #define _GLIBCXX_USE_WCHAR_T 1
2025-05-07T20:27:13.8509213Z #define _GLIBCXX_MATH_H 1
2025-05-07T20:27:13.8509442Z #define __u_char_defined 
2025-05-07T20:27:13.8509754Z #define WIFEXITED(status) __WIFEXITED (__WAIT_INT (status))
2025-05-07T20:27:13.8510104Z #define STA_PPSERROR 0x0800
2025-05-07T20:27:13.8510344Z #define _GLIBCXX_STD_A std
2025-05-07T20:27:13.8510590Z #define __FLT32_HAS_DENORM__ 1
2025-05-07T20:27:13.8510857Z #define _GLIBCXX_BEGIN_NAMESPACE_VERSION 
2025-05-07T20:27:13.8511274Z #define __device_builtin_texture_type__ __location__(device_builtin_texture_type)
2025-05-07T20:27:13.8511690Z #define FP_INFINITE 1
2025-05-07T20:27:13.8512046Z #define _GLIBCXX11_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT)
2025-05-07T20:27:13.8512444Z #define _IO_pid_t __pid_t
2025-05-07T20:27:13.8512693Z #define __UINT_FAST8_MAX__ 0xff
2025-05-07T20:27:13.8512950Z #define __LEAF , __leaf__
2025-05-07T20:27:13.8513191Z #define PATH_MAX 4096
2025-05-07T20:27:13.8513428Z #define __cpp_rvalue_reference 200610L
2025-05-07T20:27:13.8513754Z #define __LDBL_REDIR1(name,proto,alias) name proto
2025-05-07T20:27:13.8514062Z #define _LIMITS_H___ 
2025-05-07T20:27:13.8514275Z #define __size_t 
2025-05-07T20:27:13.8514499Z #define _GLIBCXX_HAVE_FREXPF 1
2025-05-07T20:27:13.8515028Z #define STA_RONLY (STA_PPSSIGNAL | STA_PPSJITTER | STA_PPSWANDER | STA_PPSERROR | STA_CLOCKERR | STA_NANO | STA_MODE | STA_CLK)
2025-05-07T20:27:13.8515569Z #define _GLIBCXX_HAVE_FREXPL 1
2025-05-07T20:27:13.8515888Z #define __cpp_nested_namespace_definitions 201411L
2025-05-07T20:27:13.8516210Z #define __DEC64_MAX_EXP__ 385
2025-05-07T20:27:13.8516464Z #define _WCHAR_T_DEFINED 
2025-05-07T20:27:13.8516812Z #define __glibcxx_requires_can_decrement_range(_First1,_Last1,_First2) 
2025-05-07T20:27:13.8517211Z #define MOD_STATUS ADJ_STATUS
2025-05-07T20:27:13.8517496Z #define _GLIBCXX_PURE __attribute__ ((__pure__))
2025-05-07T20:27:13.8517810Z #define _GLIBCXX_HAVE_STDINT_H 1
2025-05-07T20:27:13.8518106Z #define __SIZEOF_PTHREAD_CONDATTR_T 4
2025-05-07T20:27:13.8518408Z #define __INT8_C(c) c
2025-05-07T20:27:13.8518659Z #define __cudaCDP2GetParameterBuffer 
2025-05-07T20:27:13.8518953Z #define _GLIBCXX_HAVE_COSHF 1
2025-05-07T20:27:13.8519201Z #define _GLIBCXX_HAVE_COSHL 1
2025-05-07T20:27:13.8519464Z #define __SM_70_RT_HPP__ 
2025-05-07T20:27:13.8519711Z #define __INT_LEAST8_WIDTH__ 8
2025-05-07T20:27:13.8519968Z #define __cpp_variadic_using 201611L
2025-05-07T20:27:13.8520283Z #define __UINT_LEAST64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:27:13.8520707Z #define __INT_LEAST8_MAX__ 0x7f
2025-05-07T20:27:13.8520972Z #define __SM_61_INTRINSICS_HPP__ 
2025-05-07T20:27:13.8521237Z #define _IO_FLAGS2_MMAP 1
2025-05-07T20:27:13.8521489Z #define __cpp_capture_star_this 201603L
2025-05-07T20:27:13.8521795Z #define __cudaCDP2LaunchDeviceV2_ptsz 
2025-05-07T20:27:13.8522087Z #define _GLIBCXX_HAVE_ENDIAN_H 1
2025-05-07T20:27:13.8522440Z #define __always_inline __inline __attribute__ ((__always_inline__))
2025-05-07T20:27:13.8522811Z #define NFDBITS __NFDBITS
2025-05-07T20:27:13.8523055Z #define _PSTL_PRAGMA_FORCEINLINE 
2025-05-07T20:27:13.8523337Z #define _GLIBCXX_HAVE_SYS_STATVFS_H 1
2025-05-07T20:27:13.8523764Z #define __glibcxx_requires_sorted(_First,_Last) 
2025-05-07T20:27:13.8524072Z #define __SHRT_MAX__ 0x7fff
2025-05-07T20:27:13.8524326Z #define _GLIBCXX_SYMVER_GNU 1
2025-05-07T20:27:13.8524603Z #define w_stopval __wait_stopped.__w_stopval
2025-05-07T20:27:13.8524894Z #define STA_UNSYNC 0x0040
2025-05-07T20:27:13.8525197Z #define __LDBL_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:27:13.8525609Z #define _GLIBCXX_USE_C99_COMPLEX _GLIBCXX11_USE_C99_COMPLEX
2025-05-07T20:27:13.8525959Z #define __FLT64X_MAX_10_EXP__ 4932
2025-05-07T20:27:13.8526243Z #define __cpp_if_constexpr 201606L
2025-05-07T20:27:13.8526618Z #define __glibcxx_class_requires4(_a,_b,_c,_d,_e) 
2025-05-07T20:27:13.8526935Z #define _GLIBCXX_HAVE_WCHAR_H 1
2025-05-07T20:27:13.8527241Z #define _GLIBCXX_USE_C99_STDIO _GLIBCXX11_USE_C99_STDIO
2025-05-07T20:27:13.8527575Z #define __daddr_t_defined 
2025-05-07T20:27:13.8527820Z #define __LDBL_IS_IEC_60559__ 2
2025-05-07T20:27:13.8528084Z #define _GLIBCXX_TR1_RIEMANN_ZETA_TCC 1
2025-05-07T20:27:13.8528407Z #define _GLIBCXX_HAVE_STRUCT_DIRENT_D_TYPE 1
2025-05-07T20:27:13.8528923Z #define _PSTL_CPP11_STD_ROTATE_BROKEN ((__GLIBCXX__ && __GLIBCXX__ < 20150716) || (_MSC_VER && _MSC_VER < 1800))
2025-05-07T20:27:13.8529440Z #define _ACRTIMP 
2025-05-07T20:27:13.8529657Z #define _IO_EOF_SEEN 0x10
2025-05-07T20:27:13.8529919Z #define _GLIBCXX_TR1_POLY_LAGUERRE_TCC 1
2025-05-07T20:27:13.8530207Z #define _IOS_BIN 128
2025-05-07T20:27:13.8530542Z #define __fortify_function __extern_always_inline __attribute_artificial__
2025-05-07T20:27:13.8530941Z #define __FLT64X_HAS_QUIET_NAN__ 1
2025-05-07T20:27:13.8531201Z #define UNDERFLOW 4
2025-05-07T20:27:13.8531414Z #define NAME_MAX 255
2025-05-07T20:27:13.8531648Z #define SCHAR_MAX __SCHAR_MAX__
2025-05-07T20:27:13.8531919Z #define __UINT_LEAST8_MAX__ 0xff
2025-05-07T20:27:13.8532184Z #define __GCC_ATOMIC_BOOL_LOCK_FREE 2
2025-05-07T20:27:13.8532470Z #define _IO_UNIFIED_JUMPTABLES 1
2025-05-07T20:27:13.8532847Z #define __FLT128_DENORM_MIN__ 6.47517511943802511092443895822764655e-4966F128
2025-05-07T20:27:13.8533225Z #define __ptr_t void *
2025-05-07T20:27:13.8533457Z #define M_E 2.7182818284590452354
2025-05-07T20:27:13.8533728Z #define cudaSurfaceType1D 0x01
2025-05-07T20:27:13.8533988Z #define __USE_ISOCXX11 1
2025-05-07T20:27:13.8534242Z #define __UINTMAX_TYPE__ long unsigned int
2025-05-07T20:27:13.8534551Z #define cudaDeviceBlockingSync 0x04
2025-05-07T20:27:13.8534838Z #define CLOCK_MONOTONIC_COARSE 6
2025-05-07T20:27:13.8535097Z #define _GLIBCXX_OS_DEFINES 1
2025-05-07T20:27:13.8535376Z #define _GLIBCXX_NODISCARD [[__nodiscard__]]
2025-05-07T20:27:13.8535680Z #define cudaSurfaceType2D 0x02
2025-05-07T20:27:13.8535921Z #define __linux 1
2025-05-07T20:27:13.8536139Z #define __DEC32_EPSILON__ 1E-6DF
2025-05-07T20:27:13.8536407Z #define cudaDeviceMask 0xff
2025-05-07T20:27:13.8536658Z #define _GLIBCXX_END_NAMESPACE_ALGO 
2025-05-07T20:27:13.8536940Z #define __CUDA_API_VER_MAJOR__ 12
2025-05-07T20:27:13.8537213Z #define htobe16(x) __bswap_16 (x)
2025-05-07T20:27:13.8537488Z #define HUGE_VALF (__builtin_huge_valf())
2025-05-07T20:27:13.8537783Z #define __FLT_EVAL_METHOD_TS_18661_3__ 0
2025-05-07T20:27:13.8538073Z #define HUGE_VALL (__builtin_huge_vall())
2025-05-07T20:27:13.8538352Z #define _BITS_TYPES_H 1
2025-05-07T20:27:13.8538624Z #define ULONG_LONG_MAX (LONG_LONG_MAX * 2ULL + 1ULL)
2025-05-07T20:27:13.8539069Z #define _IO_cleanup_region_end(_Doit) 
2025-05-07T20:27:13.8539357Z #define cudaSurfaceType3D 0x03
2025-05-07T20:27:13.8539617Z #define _GLIBCXX_HAVE_SYS_TIME_H 1
2025-05-07T20:27:13.8539896Z #define __cudaGet_blockIdx() blockIdx
2025-05-07T20:27:13.8540547Z #define _IO_DONT_CLOSE 0100000
2025-05-07T20:27:13.8541494Z #define __MATHDECLX(type,function,suffix,args,attrib) __MATHDECL_1(type, function,suffix, args) __attribute__ (attrib); __MATHDECL_1(type, __CONCAT(__,function),suffix, args) __attribute__ (attrib)
2025-05-07T20:27:13.8542294Z #define cudaHostRegisterDefault 0x00
2025-05-07T20:27:13.8542573Z #define __unix 1
2025-05-07T20:27:13.8543024Z #define MATH_ERRNO 1
2025-05-07T20:27:13.8543256Z #define _GLIBCXX_STDIO_SEEK_END 2
2025-05-07T20:27:13.8543525Z #define _GLIBCXX_USE_FCHMODAT 1
2025-05-07T20:27:13.8543781Z #define __SM_100_RT_H__ 
2025-05-07T20:27:13.8544022Z #define __UINT32_MAX__ 0xffffffffU
2025-05-07T20:27:13.8544303Z #define __GXX_EXPERIMENTAL_CXX0X__ 1
2025-05-07T20:27:13.8544589Z #define __UID_T_TYPE __U32_TYPE
2025-05-07T20:27:13.8544865Z #define _GLIBCXX20_DEPRECATED(MSG) 
2025-05-07T20:27:13.8545163Z #define _GLIBCXX_HAVE_ATOMIC_LOCK_POLICY 1
2025-05-07T20:27:13.8545633Z #define __CUDART_API_VERSION ((__CUDA_API_VER_MAJOR__ * 1000) + (__CUDA_API_VER_MINOR__ * 10))
2025-05-07T20:27:13.8546091Z #define __nv_pure__ __location__(nv_pure)
2025-05-07T20:27:13.8546397Z #define CUDARTAPI_CDECL 
2025-05-07T20:27:13.8546652Z #define _PSTL_USAGE_WARNINGS 0
2025-05-07T20:27:13.8546926Z #define _GLIBCXX98_USE_C99_COMPLEX 1
2025-05-07T20:27:13.8547204Z #define __cpp_lib_void_t 201411
2025-05-07T20:27:13.8547465Z #define _POSIX_AIO_MAX 1
2025-05-07T20:27:13.8547791Z #define __SIZE_T 
2025-05-07T20:27:13.8548032Z #define isgraph_l(c,l) __isgraph_l ((c), (l))
2025-05-07T20:27:13.8548346Z #define _GLIBCXX_FULLY_DYNAMIC_STRING 0
2025-05-07T20:27:13.8548639Z #define _POSIX_PIPE_BUF 512
2025-05-07T20:27:13.8548890Z #define __CUDA_RUNTIME_API_H__ 
2025-05-07T20:27:13.8549153Z #define _GLIBCXX_HAVE_STRTOLD 1
2025-05-07T20:27:13.8549415Z #define _ATFILE_SOURCE 1
2025-05-07T20:27:13.8549797Z #define __glibcxx_assert(cond) do { __glibcxx_constexpr_assert(cond); } while (false)
2025-05-07T20:27:13.8550236Z #define __WAIT_STATUS void *
2025-05-07T20:27:13.8550496Z #define __MATH_FUNCTIONS_H__ 
2025-05-07T20:27:13.8550752Z #define _GLIBCXX_HAVE_WCSTOF 1
2025-05-07T20:27:13.8551018Z #define __FLT128_MIN_EXP__ (-16381)
2025-05-07T20:27:13.8551293Z #define _GLIBCXX_HAVE_LC_MESSAGES 1
2025-05-07T20:27:13.8551555Z #define __WINT_MIN__ 0U
2025-05-07T20:27:13.8552120Z #define _PSTL_CPP14_VARIABLE_TEMPLATES_PRESENT (!__INTEL_COMPILER || __INTEL_COMPILER >= 1700) && (_MSC_FULL_VER >= 190023918 || __cplusplus >= 201402L)
2025-05-07T20:27:13.8552751Z #define isdigit_l(c,l) __isdigit_l ((c), (l))
2025-05-07T20:27:13.8553041Z #define WUNTRACED 2
2025-05-07T20:27:13.8553260Z #define _GLIBCXX_HAVE_SQRTF 1
2025-05-07T20:27:13.8553529Z #define __SIZEOF_PTHREAD_RWLOCKATTR_T 8
2025-05-07T20:27:13.8553806Z #define NZERO 20
2025-05-07T20:27:13.8554027Z #define _GLIBCXX_HAVE_MEMALIGN 1
2025-05-07T20:27:13.8554301Z #define _PSTL_PRAGMA(x) _Pragma(#x)
2025-05-07T20:27:13.8554589Z #define MOD_CLKA ADJ_OFFSET_SINGLESHOT
2025-05-07T20:27:13.8554862Z #define MOD_CLKB ADJ_TICK
2025-05-07T20:27:13.8555111Z #define __FLT128_MIN_10_EXP__ (-4931)
2025-05-07T20:27:13.8555389Z #define __FLT32X_IS_IEC_60559__ 2
2025-05-07T20:27:13.8555661Z #define __DEVICE_FUNCTIONS_H__ 
2025-05-07T20:27:13.8555926Z #define SCHAR_MIN (-SCHAR_MAX - 1)
2025-05-07T20:27:13.8556189Z #define EXIT_FAILURE 1
2025-05-07T20:27:13.8556423Z #define ADJ_MAXERROR 0x0004
2025-05-07T20:27:13.8556678Z #define __INT_LEAST16_WIDTH__ 16
2025-05-07T20:27:13.8556939Z #define _SIZE_T_DEFINED_ 
2025-05-07T20:27:13.8557182Z #define _POSIX_AIO_LISTIO_MAX 2
2025-05-07T20:27:13.8557450Z #define __cudaCDP2DeviceGetLimit 
2025-05-07T20:27:13.8557779Z #define __LDBL_REDIR_NTH(name,proto) name proto __THROW
2025-05-07T20:27:13.8558137Z #define __cudaCDP2FuncGetAttributes 
2025-05-07T20:27:13.8558416Z #define __SCHAR_MAX__ 0x7f
2025-05-07T20:27:13.8558893Z #define __FLT128_MANT_DIG__ 113
2025-05-07T20:27:13.8559158Z #define __USING_NAMESPACE_STD(name) 
2025-05-07T20:27:13.8559442Z #define _GLIBCXX_HAVE_OBSOLETE_ISINF 1
2025-05-07T20:27:13.8559841Z #define __WCHAR_MIN__ (-__WCHAR_MAX__ - 1)
2025-05-07T20:27:13.8560151Z #define SEEK_DATA 3
2025-05-07T20:27:13.8560379Z #define __KERNEL_STRICT_NAMES 
2025-05-07T20:27:13.8560664Z #define _IO_stderr ((_IO_FILE*)(&_IO_2_1_stderr_))
2025-05-07T20:27:13.8561081Z #define _IO_ferror_unlocked(__fp) (((__fp)->_flags & _IO_ERR_SEEN) != 0)
2025-05-07T20:27:13.8561465Z #define _FUNCTEXCEPT_H 1
2025-05-07T20:27:13.8561846Z #define __INT64_C(c) c ## L
2025-05-07T20:27:13.8562114Z #define __NTH(fct) __LEAF_ATTR fct throw ()
2025-05-07T20:27:13.8562446Z #define _GLIBCXX_CONST __attribute__ ((__const__))
2025-05-07T20:27:13.8562756Z #define _GLIBCXX_HAVE_LINK 1
2025-05-07T20:27:13.8563031Z #define cudaNvSciSyncAttrWait 0x2
2025-05-07T20:27:13.8563321Z #define __GCC_ATOMIC_POINTER_LOCK_FREE 2
2025-05-07T20:27:13.8563608Z #define STA_PPSWANDER 0x0400
2025-05-07T20:27:13.8563860Z #define __INT_WCHAR_T_H 
2025-05-07T20:27:13.8564093Z #define WSTOPPED 2
2025-05-07T20:27:13.8564315Z #define _POSIX_THREAD_THREADS_MAX 64
2025-05-07T20:27:13.8564597Z #define _POSIX_MQ_OPEN_MAX 8
2025-05-07T20:27:13.8564847Z #define FP_NORMAL 4
2025-05-07T20:27:13.8565079Z #define __cudaCDP2LaunchDevice_ptsz 
2025-05-07T20:27:13.8565345Z #define _BITS_TIMEX_H 1
2025-05-07T20:27:13.8565573Z #define _POSIX_LINK_MAX 8
2025-05-07T20:27:13.8565819Z #define _GLIBCXX_HAVE_LIMIT_FSIZE 1
2025-05-07T20:27:13.8566085Z #define _GLIBCXX_HAVE_ATAN2F 1
2025-05-07T20:27:13.8566357Z #define cudaTextureType1D 0x01
2025-05-07T20:27:13.8566618Z #define _GLIBCXX_HAVE_ATAN2L 1
2025-05-07T20:27:13.8566869Z #define COLL_WEIGHTS_MAX 255
2025-05-07T20:27:13.8567133Z #define __isascii(c) (((c) & ~0x7f) == 0)
2025-05-07T20:27:13.8567425Z #define __toascii(c) ((c) & 0x7f)
2025-05-07T20:27:13.8567837Z #define __attribute_format_strfmon__(a,b) __attribute__ ((__format__ (__strfmon__, a, b)))
2025-05-07T20:27:13.8568284Z #define _IO_MAGIC 0xFBAD0000
2025-05-07T20:27:13.8568540Z #define _GLIBCXX_USE_SENDFILE 1
2025-05-07T20:27:13.8568794Z #define _POSIX_SOURCE 1
2025-05-07T20:27:13.8569033Z #define cudaTextureType2D 0x02
2025-05-07T20:27:13.8569290Z #define _PTR_TRAITS_H 1
2025-05-07T20:27:13.8569553Z #define _GLIBCXX_NOEXCEPT_QUAL noexcept (_NE)
2025-05-07T20:27:13.8569853Z #define _GLIBCXX_HAVE_POWF 1
2025-05-07T20:27:13.8570111Z #define _POSIX2_BC_STRING_MAX 1000
2025-05-07T20:27:13.8570423Z #define __attribute_used__ __attribute__ ((__used__))
2025-05-07T20:27:13.8570744Z #define cudaTextureType3D 0x03
2025-05-07T20:27:13.8571004Z #define _STDIO_USES_IOSTREAM 
2025-05-07T20:27:13.8571255Z #define CLOCK_REALTIME 0
2025-05-07T20:27:13.8571490Z #define __FLT32X_MANT_DIG__ 53
2025-05-07T20:27:13.8571753Z #define __GCC_ATOMIC_CHAR16_T_LOCK_FREE 2
2025-05-07T20:27:13.8572046Z #define __cpp_aligned_new 201606L
2025-05-07T20:27:13.8572307Z #define __USER_LABEL_PREFIX__ 
2025-05-07T20:27:13.8572580Z #define cudaEventBlockingSync 0x01
2025-05-07T20:27:13.8572861Z #define _GLIBCXX_HAVE_TANL 1
2025-05-07T20:27:13.8573118Z #define _GLIBCXX_USE_PTHREAD_RWLOCK_T 1
2025-05-07T20:27:13.8573425Z #define _GLIBCXX_HAVE_LINUX_RANDOM_H 1
2025-05-07T20:27:13.8573711Z #define _GLIBCXX_USE_C99_FENV_TR1 1
2025-05-07T20:27:13.8573984Z #define __FLT32_MAX_10_EXP__ 38
2025-05-07T20:27:13.8574224Z #define __GLIBC__ 2
2025-05-07T20:27:13.8574433Z #define __END_DECLS }
2025-05-07T20:27:13.8574664Z #define FP_ILOGB0 (-2147483647 - 1)
2025-05-07T20:27:13.8575021Z #define __FLT64X_EPSILON__ 1.08420217248550443400745280086994171e-19F64x
2025-05-07T20:27:13.8575393Z #define __CONCAT(x,y) x ## y
2025-05-07T20:27:13.8631097Z #define WCONTINUED 8
2025-05-07T20:27:13.8631525Z #define __STDC_HOSTED__ 1
2025-05-07T20:27:13.8631872Z #define _GLIBCXX_HAVE_ARPA_INET_H 1
2025-05-07T20:27:13.8632236Z #define _ALLOCA_H 1
2025-05-07T20:27:13.8632533Z #define __host__ __location__(host)
2025-05-07T20:27:13.8634051Z #define __warndecl(name,msg) extern void name (void) __attribute__((__warning__ (msg)))
2025-05-07T20:27:13.8634630Z #define __SLONG32_TYPE int
2025-05-07T20:27:13.8634976Z #define _GLIBCXX_DEBUG_ASSERTIONS_H 1
2025-05-07T20:27:13.8635364Z #define _SYS_SELECT_H 1
2025-05-07T20:27:13.8635684Z #define _IO_LINE_BUF 0x200
2025-05-07T20:27:13.8636019Z #define _IOS_NOCREATE 32
2025-05-07T20:27:13.8636347Z #define __DEC64_MIN_EXP__ (-382)
2025-05-07T20:27:13.8636728Z #define __cudaGet_warpSize() warpSize
2025-05-07T20:27:13.8637122Z #define __SSIZE_T_TYPE __SWORD_TYPE
2025-05-07T20:27:13.8637501Z #define _GLIBCXX_HAVE_LIMIT_VMEM 0
2025-05-07T20:27:13.8638024Z #define __global__ __location__(global)
2025-05-07T20:27:13.8638466Z #define __GNU_LIBRARY__ 6
2025-05-07T20:27:13.8638804Z #define __cpp_decltype_auto 201304L
2025-05-07T20:27:13.8639174Z #define __DBL_DIG__ 15
2025-05-07T20:27:13.8639477Z #define TIME_UTC 1
2025-05-07T20:27:13.8639766Z #define __FLT32_DIG__ 6
2025-05-07T20:27:13.8640493Z #define __forceinline__ __inline__ __attribute__((always_inline))
2025-05-07T20:27:13.8641036Z #define cudaHostAllocWriteCombined 0x04
2025-05-07T20:27:13.8641392Z #define cudaDeviceScheduleAuto 0x00
2025-05-07T20:27:13.8641697Z #define iscntrl_l(c,l) __iscntrl_l ((c), (l))
2025-05-07T20:27:13.8641985Z #define _G_BUFSIZ 8192
2025-05-07T20:27:13.8642280Z #define __FLT_EPSILON__ 1.19209289550781250000000000000000000e-7F
2025-05-07T20:27:13.8642633Z #define cudaTextureTypeCubemap 0x0C
2025-05-07T20:27:13.8642926Z #define __cudaCDP2GetDevice 
2025-05-07T20:27:13.8643196Z #define __cudaCDP2PeekAtLastError 
2025-05-07T20:27:13.8643472Z #define STA_CLOCKERR 0x1000
2025-05-07T20:27:13.8643713Z #define __GXX_WEAK__ 1
2025-05-07T20:27:13.8643961Z #define __RLIM_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:27:13.8644255Z #define _GLIBCXX_HAVE_ISNANF 1
2025-05-07T20:27:13.8644508Z #define __SHRT_WIDTH__ 16
2025-05-07T20:27:13.8644797Z #define __cpp_lib_robust_nonmodifying_seq_ops 201304
2025-05-07T20:27:13.8645120Z #define _GLIBCXX_BITS_SPECFUN_H 1
2025-05-07T20:27:13.8645397Z #define _GLIBCXX_HAVE_ISNANL 1
2025-05-07T20:27:13.8645679Z #define isblank_l(c,l) __isblank_l ((c), (l))
2025-05-07T20:27:13.8645965Z #define _G_config_h 1
2025-05-07T20:27:13.8646238Z #define M_LOG2El 1.442695040888963407359924681001892137L
2025-05-07T20:27:13.8646584Z #define ADJ_OFFSET_SINGLESHOT 0x8001
2025-05-07T20:27:13.8646953Z #define _GCC_WCHAR_T 
2025-05-07T20:27:13.8647170Z #define TMP_MAX 238328
2025-05-07T20:27:13.8647407Z #define __FLT32_IS_IEC_60559__ 2
2025-05-07T20:27:13.8647675Z #define __DEVICE_TYPES_H__ 
2025-05-07T20:27:13.8647921Z #define __DEV_T_TYPE __UQUAD_TYPE
2025-05-07T20:27:13.8648199Z #define _EXT_NUMERIC_TRAITS 1
2025-05-07T20:27:13.8648467Z #define _GLIBCXX_BEGIN_NAMESPACE_ALGO 
2025-05-07T20:27:13.8648738Z #define _IO_SKIPWS 01
2025-05-07T20:27:13.8649154Z #define cudaStreamGraphFireAndForgetAsSibling (cudaStream_t)0x0300000000000000
2025-05-07T20:27:13.8649602Z #define _IO_SCIENTIFIC 04000
2025-05-07T20:27:13.8649963Z #define _GLIBCXX_HAVE_STRING_H 1
2025-05-07T20:27:13.8650446Z #define __LDBL_MIN__ 3.36210314311209350626267781732175260e-4932L
2025-05-07T20:27:13.8650899Z #define cudaDeviceScheduleSpin 0x01
2025-05-07T20:27:13.8651263Z #define __nonnull(params) __attribute__ ((__nonnull__ params))
2025-05-07T20:27:13.8651615Z #define __DBL_IS_IEC_60559__ 2
2025-05-07T20:27:13.8651861Z #define le32toh(x) (x)
2025-05-07T20:27:13.8652087Z #define _SIZE_T_DEFINED 
2025-05-07T20:27:13.8652324Z #define _GLIBCXX_HAVE_XLOCALE_H 1
2025-05-07T20:27:13.8652654Z #define cudaArraySparsePropertiesSingleMipTail 0x1
2025-05-07T20:27:13.8653000Z #define __DEC32_MAX__ 9.999999E96DF
2025-05-07T20:27:13.8653389Z #define __WIFSIGNALED(status) (((signed char) (((status) & 0x7f) + 1) >> 1) > 0)
2025-05-07T20:27:13.8653791Z #define _GLIBCXX_HAVE_FMODL 1
2025-05-07T20:27:13.8654053Z #define _GLIBCXX_HAVE_POLL 1
2025-05-07T20:27:13.8654304Z #define __SM_32_INTRINSICS_H__ 
2025-05-07T20:27:13.8654563Z #define _POSIX_NAME_MAX 14
2025-05-07T20:27:13.8655095Z #define __cpp_threadsafe_static_init 200806L
2025-05-07T20:27:13.8655601Z #define _GLIBCXX_MAKE_MOVE_IF_NOEXCEPT_ITERATOR(_Iter) std::__make_move_if_noexcept_iterator(_Iter)
2025-05-07T20:27:13.8656079Z #define _GLIBCXX_USE_CLOCK_REALTIME 1
2025-05-07T20:27:13.8656379Z #define __cpp_enumerator_attributes 201411L
2025-05-07T20:27:13.8656722Z #define __WCOREDUMP(status) ((status) & __WCOREFLAG)
2025-05-07T20:27:13.8657021Z #define _WCHAR_T_ 
2025-05-07T20:27:13.8657239Z #define _GLIBCXX_FAST_MATH 0
2025-05-07T20:27:13.8657603Z #define __FLT64X_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951F64x
2025-05-07T20:27:13.8658121Z #define RTSIG_MAX 32
2025-05-07T20:27:13.8658340Z #define _STDDEF_H 
2025-05-07T20:27:13.8658565Z #define CU_UUID_HAS_BEEN_DEFINED 
2025-05-07T20:27:13.8658824Z #define _VA_LIST_DEFINED 
2025-05-07T20:27:13.8659073Z #define __FLT32X_HAS_INFINITY__ 1
2025-05-07T20:27:13.8659396Z #define __glibcxx_requires_non_empty_range(_First,_Last) 
2025-05-07T20:27:13.8659777Z #define __grid_constant__ __location__(grid_constant)
2025-05-07T20:27:13.8660101Z #define __INT32_MAX__ 0x7fffffff
2025-05-07T20:27:13.8660385Z #define _GLIBCXX_BEGIN_EXTERN_C extern "C" {
2025-05-07T20:27:13.8660832Z #define _PSTL_CPP14_INTEGER_SEQUENCE_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L)
2025-05-07T20:27:13.8661339Z #define __glibcxx_digits_b(T,B) (B - __glibcxx_signed_b (T,B))
2025-05-07T20:27:13.8661691Z #define __SIZEOF_PTHREAD_COND_T 48
2025-05-07T20:27:13.8661998Z #define _PSTL_PRAGMA_SIMD_ORDERED_MONOTONIC(PRM) 
2025-05-07T20:27:13.8662291Z #define __unix__ 1
2025-05-07T20:27:13.8662515Z #define __SM_60_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:27:13.8662789Z #define __INT_WIDTH__ 32
2025-05-07T20:27:13.8663016Z #define __SIZEOF_LONG__ 8
2025-05-07T20:27:13.8663244Z #define _IONBF 2
2025-05-07T20:27:13.8663674Z #define __MATHCALLX(function,suffix,args,attrib) __MATHDECLX (_Mdouble_,function,suffix, args, attrib)
2025-05-07T20:27:13.8664416Z #define _IO_getc_unlocked(_fp) (_IO_BE ((_fp)->_IO_read_ptr >= (_fp)->_IO_read_end, 0) ? __uflow (_fp) : *(unsigned char *) (_fp)->_IO_read_ptr++)
2025-05-07T20:27:13.8664943Z #define __STDC_IEC_559__ 1
2025-05-07T20:27:13.8665189Z #define __STDC_ISO_10646__ 201103L
2025-05-07T20:27:13.8665445Z #define __UINT16_C(c) c
2025-05-07T20:27:13.8665672Z #define M_2_PI 0.63661977236758134308
2025-05-07T20:27:13.8665929Z #define STA_DEL 0x0020
2025-05-07T20:27:13.8666163Z #define __CUDACC_VER_MINOR__ 8
2025-05-07T20:27:13.8666403Z #define __id_t_defined 
2025-05-07T20:27:13.8666663Z #define w_retcode __wait_terminated.__w_retcode
2025-05-07T20:27:13.8667106Z #define _IO_PENDING_OUTPUT_COUNT(_fp) ((_fp)->_IO_write_ptr - (_fp)->_IO_write_base)
2025-05-07T20:27:13.8667509Z #define _GLIBCXX_HAVE_MODFF 1
2025-05-07T20:27:13.8667873Z #define _GLIBCXX_HAVE_MODFL 1
2025-05-07T20:27:13.8668121Z #define __DECIMAL_DIG__ 21
2025-05-07T20:27:13.8668369Z #define _POSIX2_RE_DUP_MAX 255
2025-05-07T20:27:13.8668614Z #define __USE_FORTIFY_LEVEL 0
2025-05-07T20:27:13.8668868Z #define __STDC_IEC_559_COMPLEX__ 1
2025-05-07T20:27:13.8669124Z #define SING 2
2025-05-07T20:27:13.8669325Z #define STA_FREQHOLD 0x0080
2025-05-07T20:27:13.8669580Z #define __SM_32_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:27:13.8669866Z #define cudaStreamDefault 0x00
2025-05-07T20:27:13.8670200Z #define __FLT64_EPSILON__ 2.22044604925031308084726333618164062e-16F64
2025-05-07T20:27:13.8670570Z #define _GLIBCXX_HAVE_HYPOTL 1
2025-05-07T20:27:13.8670827Z #define _GLIBCXX_HAVE_SYS_UIO_H 1
2025-05-07T20:27:13.8671078Z #define __gnu_linux__ 1
2025-05-07T20:27:13.8671306Z #define __INT16_MAX__ 0x7fff
2025-05-07T20:27:13.8671556Z #define _LARGEFILE_SOURCE 1
2025-05-07T20:27:13.8671859Z #define MAX_INPUT 255
2025-05-07T20:27:13.8672101Z #define __FLT64_MIN_EXP__ (-1021)
2025-05-07T20:27:13.8672416Z #define __isalpha_l(c,l) __isctype_l((c), _ISalpha, (l))
2025-05-07T20:27:13.8672772Z #define __glibcxx_requires_heap(_First,_Last) 
2025-05-07T20:27:13.8673079Z #define _GLIBCXX_CPU_DEFINES 1
2025-05-07T20:27:13.8673339Z #define _GLIBCXX_HAVE_POLL_H 1
2025-05-07T20:27:13.8673836Z #define __attribute_warn_unused_result__ __attribute__ ((__warn_unused_result__))
2025-05-07T20:27:13.8674236Z #define _IO_SHOWPOS 02000
2025-05-07T20:27:13.8674556Z #define _GLIBCXX_HAVE_SYMVER_SYMBOL_RENAMING_RUNTIME_SUPPORT 1
2025-05-07T20:27:13.8674906Z #define _Mfloat_ float
2025-05-07T20:27:13.8675155Z #define __glibcxx_requires_cond(_Cond,_Msg) 
2025-05-07T20:27:13.8675456Z #define __FLT64X_MIN_10_EXP__ (-4931)
2025-05-07T20:27:13.8675739Z #define DELAYTIMER_MAX 2147483647
2025-05-07T20:27:13.8676051Z #define cudaMemPoolCreateUsageHwDecompress 0x2
2025-05-07T20:27:13.8676665Z #define __glibcxx_max_b(T,B) (__glibcxx_signed_b (T,B) ? (((((T)1 << (__glibcxx_digits_b (T,B) - 1)) - 1) << 1) + 1) : ~(T)0)
2025-05-07T20:27:13.8677163Z #define __LDBL_HAS_QUIET_NAN__ 1
2025-05-07T20:27:13.8677430Z #define _GLIBCXX98_USE_C99_STDIO 1
2025-05-07T20:27:13.8677738Z #define cudaKernelNodeAttrID cudaLaunchAttributeID
2025-05-07T20:27:13.8678093Z #define __glibcxx_class_requires2(_a,_b,_c) 
2025-05-07T20:27:13.8678382Z #define __USE_ISOC11 1
2025-05-07T20:27:13.8678597Z #define _BSD_SIZE_T_ 
2025-05-07T20:27:13.8678816Z #define ADJ_MICRO 0x1000
2025-05-07T20:27:13.8679054Z #define _GLIBCXX_HAVE_FABSF 1
2025-05-07T20:27:13.8679296Z #define _GLIBCXX_HAVE_FABSL 1
2025-05-07T20:27:13.8679581Z #define _PSTL_PRAGMA_SIMD _PSTL_PRAGMA(omp simd)
2025-05-07T20:27:13.8679888Z #define __FLT64_MANT_DIG__ 53
2025-05-07T20:27:13.8680264Z #define __attribute_const__ __attribute__ ((__const__))
2025-05-07T20:27:13.8680642Z #define __THROW throw ()
2025-05-07T20:27:13.8680883Z #define __cudaGet_gridDim() gridDim
2025-05-07T20:27:13.8681168Z #define __SM_60_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:27:13.8681507Z #define __glibcxx_requires_heap_pred(_First,_Last,_Pred) 
2025-05-07T20:27:13.8681850Z #define htobe32(x) __bswap_32 (x)
2025-05-07T20:27:13.8682117Z #define _GLIBCXX_HAVE_POWL 1
2025-05-07T20:27:13.8682364Z #define __FLT64X_MANT_DIG__ 64
2025-05-07T20:27:13.8682626Z #define __GLIBC_HAVE_LONG_LONG 1
2025-05-07T20:27:13.8682878Z #define L_tmpnam 20
2025-05-07T20:27:13.8683086Z #define ___int_wchar_t_h 
2025-05-07T20:27:13.8683417Z #define WIFCONTINUED(status) __WIFCONTINUED (__WAIT_INT (status))
2025-05-07T20:27:13.8683795Z #define isascii(c) __isascii (c)
2025-05-07T20:27:13.8684039Z #define _T_PTRDIFF 
2025-05-07T20:27:13.8684335Z #define _GLIBCXX_MOVE3(_Tp,_Up,_Vp) std::move(_Tp, _Up, _Vp)
2025-05-07T20:27:13.8684675Z #define toascii(c) __toascii (c)
2025-05-07T20:27:13.8684927Z #define __GNUC__ 11
2025-05-07T20:27:13.8685167Z #define __SYSCALL_ULONG_TYPE __ULONGWORD_TYPE
2025-05-07T20:27:13.8685455Z #define __GXX_RTTI 1
2025-05-07T20:27:13.8685673Z #define __pie__ 2
2025-05-07T20:27:13.8685868Z #define __MMX__ 1
2025-05-07T20:27:13.8686081Z #define __cudaCDP2Malloc 
2025-05-07T20:27:13.8686328Z #define __timespec_defined 1
2025-05-07T20:27:13.8686566Z #define L_ctermid 9
2025-05-07T20:27:13.8686796Z #define __OFF64_T_TYPE __SQUAD_TYPE
2025-05-07T20:27:13.8687094Z #define __cudaCDP2GetParameterBufferV2 
2025-05-07T20:27:13.8687478Z #define offsetof(TYPE,MEMBER) __builtin_offsetof (TYPE, MEMBER)
2025-05-07T20:27:13.8687837Z #define _BITS_POSIX2_LIM_H 1
2025-05-07T20:27:13.8688093Z #define _GLIBCXX98_USE_C99_STDLIB 1
2025-05-07T20:27:13.8688367Z #define cudaMemAttachGlobal 0x01
2025-05-07T20:27:13.8688660Z #define FD_SET(fd,fdsetp) __FD_SET (fd, fdsetp)
2025-05-07T20:27:13.8688971Z #define __FLT_HAS_DENORM__ 1
2025-05-07T20:27:13.8689225Z #define __SIZEOF_LONG_DOUBLE__ 16
2025-05-07T20:27:13.8689643Z #define _GLIBCXX_NATIVE_THREAD_ID (__gthread_active_p() ? __gthread_self() : (__gthread_t)1)
2025-05-07T20:27:13.8690374Z #define assert_perror(errnum) (!(errnum) ? __ASSERT_VOID_CAST (0) : __assert_perror_fail ((errnum), __FILE__, __LINE__, __ASSERT_FUNCTION))
2025-05-07T20:27:13.8690956Z #define _IO_HAVE_ST_BLKSIZE _G_HAVE_ST_BLKSIZE
2025-05-07T20:27:13.8691251Z #define __USE_SVID 1
2025-05-07T20:27:13.8691497Z #define __constant__ __location__(constant)
2025-05-07T20:27:13.8691799Z #define _GLIBCXX_HAVE_POSIX_MEMALIGN 1
2025-05-07T20:27:13.8692181Z #define __device__ __location__(device)
2025-05-07T20:27:13.8692501Z #define _GLIBCXX_HAVE_EXCEPTION_PTR_SINCE_GCC46 1
2025-05-07T20:27:13.8692809Z #define _GLIBCXX_RES_LIMITS 1
2025-05-07T20:27:13.8693057Z #define M_1_PI 0.31830988618379067154
2025-05-07T20:27:13.8693346Z #define CUDART_DEVICE __device__
2025-05-07T20:27:13.8693687Z #define __LDBL_REDIR1_NTH(name,proto,alias) name proto __THROW
2025-05-07T20:27:13.8694043Z #define M_PI_2 1.57079632679489661923
2025-05-07T20:27:13.8694317Z #define __BIGGEST_ALIGNMENT__ 16
2025-05-07T20:27:13.8694672Z #define cudaExternalSemaphoreWaitSkipNvSciBufMemSync 0x02
2025-05-07T20:27:13.8695159Z #define __STDC_UTF_16__ 1
2025-05-07T20:27:13.8695399Z #define LONG_MAX __LONG_MAX__
2025-05-07T20:27:13.8695754Z #define __glibcxx_digits10_b(T,B) (__glibcxx_digits_b (T,B) * 643L / 2136)
2025-05-07T20:27:13.8696162Z #define _POSIX_THREAD_DESTRUCTOR_ITERATIONS 4
2025-05-07T20:27:13.8696458Z #define _POSIX_HOST_NAME_MAX 255
2025-05-07T20:27:13.8696731Z #define __FLT64_MAX_10_EXP__ 308
2025-05-07T20:27:13.8696992Z #define NGROUPS_MAX 65536
2025-05-07T20:27:13.8697232Z #define _GLIBCXX_NAMESPACE_LDBL 
2025-05-07T20:27:13.8697483Z #define __USE_ISOC95 1
2025-05-07T20:27:13.8697700Z #define _TIME_H 1
2025-05-07T20:27:13.8697954Z #define M_LOG10El 0.434294481903251827651128918916605082L
2025-05-07T20:27:13.8698262Z #define __USE_ISOC99 1
2025-05-07T20:27:13.8698581Z #define __ASMNAME(cname) __ASMNAME2 (__USER_LABEL_PREFIX__, cname)
2025-05-07T20:27:13.8698933Z #define HOST_NAME_MAX 64
2025-05-07T20:27:13.8699171Z #define _POSIX_SEM_NSEMS_MAX 256
2025-05-07T20:27:13.8699423Z #define _IOS_ATEND 4
2025-05-07T20:27:13.8699640Z #define __SM_35_INTRINSICS_H__ 
2025-05-07T20:27:13.8699957Z #define WTERMSIG(status) __WTERMSIG (__WAIT_INT (status))
2025-05-07T20:27:13.8700352Z #define cudaStreamAttrValue cudaLaunchAttributeValue
2025-05-07T20:27:13.8700685Z #define _GLIBCXX_HAVE_S_ISREG 1
2025-05-07T20:27:13.8700948Z #define cudaSurfaceTypeCubemap 0x0C
2025-05-07T20:27:13.8701260Z #define __cpp_delegating_constructors 200604L
2025-05-07T20:27:13.8701567Z #define __FLT32_HAS_INFINITY__ 1
2025-05-07T20:27:13.8701811Z #define _STDIO_H 1
2025-05-07T20:27:13.8702192Z #define __isctype_l(c,type,locale) ((locale)->__ctype_b[(int) (c)] & (unsigned short int) type)
2025-05-07T20:27:13.8702647Z #define _GLIBCXX_PREDEFINED_OPS_H 1
2025-05-07T20:27:13.8702988Z #define __DBL_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:27:13.8703354Z #define _G_IO_IO_FILE_VERSION 0x20001
2025-05-07T20:27:13.8703638Z #define _POSIX_SIGQUEUE_MAX 32
2025-05-07T20:27:13.8703896Z #define _GLIBCXX_HAVE_GETS 1
2025-05-07T20:27:13.8704163Z #define _GLIBCXX_HAVE_LINUX_TYPES_H 1
2025-05-07T20:27:13.8704447Z #define __cpp_raw_strings 200710L
2025-05-07T20:27:13.8704740Z #define __INT_FAST32_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:13.8705040Z #define _GLIBCXX_HAVE_VFWSCANF 1
2025-05-07T20:27:13.8705305Z #define __DBL_HAS_INFINITY__ 1
2025-05-07T20:27:13.8705574Z #define __STDCPP_MATH_SPEC_FUNCS__ 201003L
2025-05-07T20:27:13.8705965Z #define _GLIBCXX_STDIO_EOF -1
2025-05-07T20:27:13.8706255Z #define __SIZEOF_PTHREAD_MUTEX_T 40
2025-05-07T20:27:13.8706530Z #define __CHANNEL_DESCRIPTOR_H__ 
2025-05-07T20:27:13.8706873Z #define _ISbit(bit) ((bit) < 8 ? ((1 << (bit)) << 8) : ((1 << (bit)) >> 8))
2025-05-07T20:27:13.8707234Z #define __SIZEOF_FLOAT__ 4
2025-05-07T20:27:13.8707472Z #define __USE_XOPEN 1
2025-05-07T20:27:13.8707788Z #define __SIZEOF_PTHREAD_RWLOCK_T 56
2025-05-07T20:27:13.8708217Z #define cudaStreamAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain
2025-05-07T20:27:13.8708641Z #define __USE_XOPEN2K 1
2025-05-07T20:27:13.8708881Z #define _PSTL_UDR_PRESENT 1
2025-05-07T20:27:13.8709132Z #define __HAVE_SPECULATION_SAFE_VALUE 1
2025-05-07T20:27:13.8709458Z #define _GLIBCXX_HAVE_COSF 1
2025-05-07T20:27:13.8709837Z #define __cpp_fold_expressions 201603L
2025-05-07T20:27:13.8710576Z #define cudaWaitExternalSemaphoresAsync __CUDART_API_PTSZ(cudaWaitExternalSemaphoresAsync_v2)
2025-05-07T20:27:13.8711207Z #define NL_LANGMAX _POSIX2_LINE_MAX
2025-05-07T20:27:13.8711482Z #define __DEC32_MIN_EXP__ (-94)
2025-05-07T20:27:13.8711824Z #define __glibcxx_requires_partitioned_upper(_First,_Last,_Value) 
2025-05-07T20:27:13.8712197Z #define __DADDR_T_TYPE __S32_TYPE
2025-05-07T20:27:13.8712572Z #define cudaExternalSemaphoreSignalSkipNvSciBufMemSync 0x01
2025-05-07T20:27:13.8712958Z #define __END_NAMESPACE_C99 
2025-05-07T20:27:13.8713227Z #define __glibcxx_integral_traps true
2025-05-07T20:27:13.8713507Z #define _POSIX_PATH_MAX 256
2025-05-07T20:27:13.8713761Z #define __INTPTR_WIDTH__ 64
2025-05-07T20:27:13.8714095Z #define __FLT64X_HAS_INFINITY__ 1
2025-05-07T20:27:13.8714355Z #define _IOS_TRUNC 16
2025-05-07T20:27:13.8714581Z #define _ISOC11_SOURCE 1
2025-05-07T20:27:13.8714817Z #define _GLIBCXX_HAVE_LINUX_FUTEX 1
2025-05-07T20:27:13.8715095Z #define __UINT_LEAST32_MAX__ 0xffffffffU
2025-05-07T20:27:13.8715386Z #define _GLIBCXX_HAVE_QUICK_EXIT 1
2025-05-07T20:27:13.8715733Z #define __glibcxx_requires_irreflexive_pred2(_First,_Last,_Pred) 
2025-05-07T20:27:13.8716121Z #define LONG_MIN (-LONG_MAX - 1L)
2025-05-07T20:27:13.8716393Z #define _GLIBCXX_HAVE_SINCOSF 1
2025-05-07T20:27:13.8716641Z #define _IO_UNITBUF 020000
2025-05-07T20:27:13.8716896Z #define _GLIBCXX_HAVE_SINCOSL 1
2025-05-07T20:27:13.8717150Z #define __FD_SETSIZE 1024
2025-05-07T20:27:13.8717389Z #define getc(_fp) _IO_getc (_fp)
2025-05-07T20:27:13.8717655Z #define be32toh(x) __bswap_32 (x)
2025-05-07T20:27:13.8717988Z #define _GLIBCXX_PACKAGE__GLIBCXX_VERSION "version-unused"
2025-05-07T20:27:13.8718335Z #define __FLT32X_HAS_DENORM__ 1
2025-05-07T20:27:13.8718592Z #define __INT_FAST16_TYPE__ long int
2025-05-07T20:27:13.8718893Z #define isxdigit_l(c,l) __isxdigit_l ((c), (l))
2025-05-07T20:27:13.8719206Z #define _GLIBCXX_HAVE_GETIPINFO 1
2025-05-07T20:27:13.8719462Z #define __MMX_WITH_SSE__ 1
2025-05-07T20:27:13.8719755Z #define __isalnum_l(c,l) __isctype_l((c), _ISalnum, (l))
2025-05-07T20:27:13.8720080Z #define _WCHAR_T_DEFINED_ 
2025-05-07T20:27:13.8720358Z #define cudaIpcMemLazyEnablePeerAccess 0x01
2025-05-07T20:27:13.8720674Z #define _GLIBCXX_HAVE_AT_QUICK_EXIT 1
2025-05-07T20:27:13.8720964Z #define __INO_T_MATCHES_INO64_T 1
2025-05-07T20:27:13.8721229Z #define __USE_POSIX199506 1
2025-05-07T20:27:13.8721475Z #define _FEATURES_H 1
2025-05-07T20:27:13.8721705Z #define __LDBL_HAS_DENORM__ 1
2025-05-07T20:27:13.8722095Z #define _PSTL_PRAGMA_SIMD_REDUCTION(PRM) _PSTL_PRAGMA(omp simd reduction(PRM))
2025-05-07T20:27:13.8722549Z #define __WEXITSTATUS(status) (((status) & 0xff00) >> 8)
2025-05-07T20:27:13.8722870Z #define __stub_getmsg 
2025-05-07T20:27:13.8723103Z #define _IO_FIXED 010000
2025-05-07T20:27:13.8723362Z #define __cpp_lib_addressof_constexpr 201603
2025-05-07T20:27:13.8723668Z #define _GLIBCXX11_USE_C99_STDIO 1
2025-05-07T20:27:13.8723934Z #define __stub_setlogin 
2025-05-07T20:27:13.8724163Z #define __stub_fattach 
2025-05-07T20:27:13.8724400Z #define __cplusplus 201703L
2025-05-07T20:27:13.8724664Z #define __cpp_ref_qualifiers 200710L
2025-05-07T20:27:13.8724933Z #define _STRUCT_TIMEVAL 1
2025-05-07T20:27:13.8725187Z #define INFINITY (__builtin_inff())
2025-05-07T20:27:13.8725465Z #define _IO_UNBUFFERED 2
2025-05-07T20:27:13.8725939Z #define cudaStreamAttributeSynchronizationPolicy cudaLaunchAttributeSynchronizationPolicy
2025-05-07T20:27:13.8726453Z #define _IO_INTERNAL 010
2025-05-07T20:27:13.8726698Z #define __DEC32_MIN__ 1E-95DF
2025-05-07T20:27:13.8727029Z #define cudaKernelNodeAttrValue cudaLaunchAttributeValue
2025-05-07T20:27:13.8727369Z #define __dev_t_defined 
2025-05-07T20:27:13.8727601Z #define __DEPRECATED 1
2025-05-07T20:27:13.8727828Z #define __S32_TYPE int
2025-05-07T20:27:13.8728067Z #define __cpp_rvalue_references 200610L
2025-05-07T20:27:13.8728365Z #define __DBL_MAX_EXP__ 1024
2025-05-07T20:27:13.8728619Z #define _IO_fpos_t _G_fpos_t
2025-05-07T20:27:13.8728862Z #define __WCHAR_WIDTH__ 32
2025-05-07T20:27:13.8729470Z #define cudaKernelNodeAttributePreferredSharedMemoryCarveout cudaLaunchAttributePreferredSharedMemoryCarveout
2025-05-07T20:27:13.8730197Z #define _G_HAVE_MREMAP 1
2025-05-07T20:27:13.8730495Z #define __FLT32_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:27:13.8730824Z #define OVERFLOW 3
2025-05-07T20:27:13.8731063Z #define __toascii_l(c,l) ((l), __toascii (c))
2025-05-07T20:27:13.8731366Z #define __DEC128_EPSILON__ 1E-33DL
2025-05-07T20:27:13.8731635Z #define __SM_32_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:27:13.8731963Z #define _GLIBCXX_DEFAULT_ABI_TAG _GLIBCXX_ABI_TAG_CXX11
2025-05-07T20:27:13.8732285Z #define __SSE2_MATH__ 1
2025-05-07T20:27:13.8732516Z #define __ATOMIC_HLE_RELEASE 131072
2025-05-07T20:27:13.8732902Z #define __FSFILCNT_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:27:13.8733201Z #define _IO_STDIO_H 
2025-05-07T20:27:13.8733435Z #define PDP_ENDIAN __PDP_ENDIAN
2025-05-07T20:27:13.8733721Z #define isspace_l(c,l) __isspace_l ((c), (l))
2025-05-07T20:27:13.8734032Z #define __cudaCDP2Memcpy2DAsync 
2025-05-07T20:27:13.8734319Z #define __PTRDIFF_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:13.8734625Z #define _GLIBCXX_HAVE_STRERROR_R 1
2025-05-07T20:27:13.8734893Z #define __amd64 1
2025-05-07T20:27:13.8735109Z #define _POSIX_TZNAME_MAX 6
2025-05-07T20:27:13.8735363Z #define __cudaCDP2Memset3DAsync 
2025-05-07T20:27:13.8735634Z #define __SYSCALL_WORDSIZE 64
2025-05-07T20:27:13.8735916Z #define _GLIBCXX_HAVE_ATTRIBUTE_VISIBILITY 1
2025-05-07T20:27:13.8736211Z #define _EXT_TYPE_TRAITS 1
2025-05-07T20:27:13.8736469Z #define _GLIBCXX_HAVE_POSIX_SEMAPHORE 1
2025-05-07T20:27:13.8736759Z #define _POSIX_RE_DUP_MAX 255
2025-05-07T20:27:13.8737006Z #define __STDC_NO_THREADS__ 1
2025-05-07T20:27:13.8737258Z #define __bounded 
2025-05-07T20:27:13.8737482Z #define _GLIBCXX_HAVE_ACOSL 1
2025-05-07T20:27:13.8737737Z #define __USECONDS_T_TYPE __U32_TYPE
2025-05-07T20:27:13.8738017Z #define _IO_DELETE_DONT_CLOSE 0x40
2025-05-07T20:27:13.8738366Z #define __BEGIN_NAMESPACE_STD 
2025-05-07T20:27:13.8738703Z #define _PTRDIFF_T_DECLARED 
2025-05-07T20:27:13.8738975Z #define __OFF_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:27:13.8739287Z #define __W_STOPCODE(sig) ((sig) << 8 | 0x7f)
2025-05-07T20:27:13.8739696Z #define cudaStreamAttributePriority cudaLaunchAttributePriority
2025-05-07T20:27:13.8740339Z #define _GLIBCXX_HAVE_NETDB_H 1
2025-05-07T20:27:13.8740624Z #define __SM_20_INTRINSICS_HPP__ 
2025-05-07T20:27:13.8740963Z #define __cpp_lib_has_unique_object_representations 201606
2025-05-07T20:27:13.8741297Z #define STA_PLL 0x0001
2025-05-07T20:27:13.8741534Z #define __ATOMIC_HLE_ACQUIRE 65536
2025-05-07T20:27:13.8741792Z #define __GNUG__ 11
2025-05-07T20:27:13.8742011Z #define _GLIBCXX_USE_GET_NPROCS 1
2025-05-07T20:27:13.8742278Z #define _T_WCHAR 
2025-05-07T20:27:13.8742514Z #define __cudaCDP2GetDeviceCount 
2025-05-07T20:27:13.8742790Z #define __specialization_static 
2025-05-07T20:27:13.8743086Z #define __LONG_LONG_MAX__ 0x7fffffffffffffffLL
2025-05-07T20:27:13.8743388Z #define __SIZEOF_SIZE_T__ 8
2025-05-07T20:27:13.8743637Z #define cudaArraySparse 0x40
2025-05-07T20:27:13.8743893Z #define STA_PPSFREQ 0x0002
2025-05-07T20:27:13.8744172Z #define _IO_stdin ((_IO_FILE*)(&_IO_2_1_stdin_))
2025-05-07T20:27:13.8744471Z #define _WCHAR_T 
2025-05-07T20:27:13.8744678Z #define __cudaCDP2Free 
2025-05-07T20:27:13.8745313Z #define __FD_ZERO(fdsp) do { int __d0, __d1; __asm__ __volatile__ ("cld; rep; " __FD_ZERO_STOS : "=c" (__d0), "=D" (__d1) : "a" (0), "0" (sizeof (fd_set) / sizeof (__fd_mask)), "1" (&__FDS_BITS (fdsp)[0]) : "memory"); } while (0)
2025-05-07T20:27:13.8745995Z #define __cpp_nsdmi 200809L
2025-05-07T20:27:13.8746423Z #define __glibcxx_min_b(T,B) (__glibcxx_signed_b (T,B) ? -__glibcxx_max_b (T,B) - 1 : (T)0)
2025-05-07T20:27:13.8746869Z #define __FLT64X_MIN_EXP__ (-16381)
2025-05-07T20:27:13.8747138Z #define __SIZEOF_WINT_T__ 4
2025-05-07T20:27:13.8747388Z #define cudaArrayCubemap 0x04
2025-05-07T20:27:13.8747806Z #define _PSTL_MONOTONIC_PRESENT (__INTEL_COMPILER >= 1800)
2025-05-07T20:27:13.8748146Z #define _GLIBCXX_UTILITY 1
2025-05-07T20:27:13.8748386Z #define __NO_CTYPE 1
2025-05-07T20:27:13.8748853Z #define __stub_bdflush 
2025-05-07T20:27:13.8749210Z #define _GLIBCXX_MAKE_MOVE_ITERATOR(_Iter) std::make_move_iterator(_Iter)
2025-05-07T20:27:13.8749619Z #define __CORRECT_ISO_CPP_STRING_H_PROTO 
2025-05-07T20:27:13.8749903Z #define _GLIBCXX_STDC_HEADERS 1
2025-05-07T20:27:13.8750161Z #define __LONG_LONG_WIDTH__ 64
2025-05-07T20:27:13.8750429Z #define __cpp_initializer_lists 200806L
2025-05-07T20:27:13.8750714Z #define _GLIBCXX_HAVE_NETINET_TCP_H 1
2025-05-07T20:27:13.8751001Z #define __U16_TYPE unsigned short int
2025-05-07T20:27:13.8751329Z #define __glibcxx_requires_can_increment(_First,_Size) 
2025-05-07T20:27:13.8751791Z #define _GLIBCXX_HAVE_SYS_PARAM_H 1
2025-05-07T20:27:13.8752066Z #define __FLT32_MAX_EXP__ 128
2025-05-07T20:27:13.8752338Z #define cudaHostRegisterIoMemory 0x04
2025-05-07T20:27:13.8753091Z #define __FD_MASK(d) ((__fd_mask) 1 << ((d) % __NFDBITS))
2025-05-07T20:27:13.8762529Z #define __cpp_lib_is_invocable 201703
2025-05-07T20:27:13.8762819Z #define _IO_STDIO 040000
2025-05-07T20:27:13.8763156Z #define _SIGSET_NWORDS (1024 / (8 * sizeof (unsigned long int)))
2025-05-07T20:27:13.8763544Z #define cudaSurfaceType1DLayered 0xF1
2025-05-07T20:27:13.8763849Z #define cudaArraySurfaceLoadStore 0x02
2025-05-07T20:27:13.8764133Z #define _PTRDIFF_T 
2025-05-07T20:27:13.8764345Z #define _MOVE_H 1
2025-05-07T20:27:13.8764565Z #define __cpp_hex_float 201603L
2025-05-07T20:27:13.8764821Z #define ADJ_TAI 0x0080
2025-05-07T20:27:13.8765044Z #define __ptrvalue 
2025-05-07T20:27:13.8765261Z #define _GLIBCXX_HOSTED 1
2025-05-07T20:27:13.8765513Z #define __GXX_ABI_VERSION 1016
2025-05-07T20:27:13.8765817Z #define __WTERMSIG(status) ((status) & 0x7f)
2025-05-07T20:27:13.8766216Z #define MATH_ERREXCEPT 2
2025-05-07T20:27:13.8766502Z #define _GLIBCXX_HAS_GTHREADS 1
2025-05-07T20:27:13.8766786Z #define cudaTextureType2DLayered 0xF2
2025-05-07T20:27:13.8767169Z #define __isleap(year) ((year) % 4 == 0 && ((year) % 100 != 0 || (year) % 400 == 0))
2025-05-07T20:27:13.8767540Z #define __USE_GNU 1
2025-05-07T20:27:13.8767769Z #define __FLT128_HAS_INFINITY__ 1
2025-05-07T20:27:13.8768043Z #define __FLT_MIN_EXP__ (-125)
2025-05-07T20:27:13.8768349Z #define __GCC_HAVE_DWARF2_CFI_ASM 1
2025-05-07T20:27:13.8768733Z #define __FD_CLR(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] &= ~__FD_MASK (d)))
2025-05-07T20:27:13.8769112Z #define WEXITED 4
2025-05-07T20:27:13.8769317Z #define _IO_NO_READS 4
2025-05-07T20:27:13.8769660Z #define cudaGraphKernelNodePortLaunchCompletion 2
2025-05-07T20:27:13.8770041Z #define M_LOG2E 1.4426950408889634074
2025-05-07T20:27:13.8770306Z #define _POSIX_SYMLINK_MAX 255
2025-05-07T20:27:13.8770601Z #define _GLIBCXX_HAVE_BUILTIN_HAS_UNIQ_OBJ_REP 1
2025-05-07T20:27:13.8770908Z #define __uid_t_defined 
2025-05-07T20:27:13.8771143Z #define __FD_ELT(d) ((d) / __NFDBITS)
2025-05-07T20:27:13.8771431Z #define _GLIBCXX_USE_STD_SPEC_FUNCS 1
2025-05-07T20:27:13.8771694Z #define WNOHANG 1
2025-05-07T20:27:13.8771932Z #define alloca(size) __builtin_alloca (size)
2025-05-07T20:27:13.8772223Z #define _GLIBCXX_HAVE_HYPOTF 1
2025-05-07T20:27:13.8772497Z #define cudaEventDefault 0x00
2025-05-07T20:27:13.8772796Z #define __maxnreg__(a) __attribute__((maxnreg(a)))
2025-05-07T20:27:13.8773101Z #define NL_SETMAX INT_MAX
2025-05-07T20:27:13.8773334Z #define __x86_64 1
2025-05-07T20:27:13.8773562Z #define __cudaCDP2LaunchDevice 
2025-05-07T20:27:13.8773941Z #define __REDIRECT(name,proto,alias) name proto __asm__ (__ASMNAME (#alias))
2025-05-07T20:27:13.8774421Z #define _GLIBCXX_BEGIN_NAMESPACE_CXX11 namespace __cxx11 {
2025-05-07T20:27:13.8774914Z #define __extern_always_inline extern __always_inline __attribute__ ((__gnu_inline__))
2025-05-07T20:27:13.8775331Z #define __PTRDIFF_T 
2025-05-07T20:27:13.8775650Z #define __exctype_l(name) extern int name (int, __locale_t) __THROW
2025-05-07T20:27:13.8776015Z #define _GLIBCXX_HAVE_FINITEL 1
2025-05-07T20:27:13.8776285Z #define __SM_35_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:27:13.8776566Z #define _Mlong_double_ long double
2025-05-07T20:27:13.8776837Z #define __cpp_lambdas 200907L
2025-05-07T20:27:13.8777302Z #define _IO_DEC 020
2025-05-07T20:27:13.8777515Z #define _GLIBCXX_HAVE_SINHL 1
2025-05-07T20:27:13.8777774Z #define _POSIX_CLOCKRES_MIN 20000000
2025-05-07T20:27:13.8778058Z #define __INT_FAST64_TYPE__ long int
2025-05-07T20:27:13.8778321Z #define ADJ_TIMECONST 0x0020
2025-05-07T20:27:13.8778579Z #define _GLIBCXX_HAVE_SQRTL 1
2025-05-07T20:27:13.8778870Z #define __cudaCDP2DeviceGetSharedMemConfig 
2025-05-07T20:27:13.8779184Z #define _GLIBCXX_HAVE_STDALIGN_H 1
2025-05-07T20:27:13.8779449Z #define _ANSI_STDDEF_H 
2025-05-07T20:27:13.8779722Z #define _GLIBCXX_MOVE(__val) std::move(__val)
2025-05-07T20:27:13.8780126Z #define _GLIBCXX_HAVE_STRERROR_L 1
2025-05-07T20:27:13.8780486Z #define __FLT64_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F64
2025-05-07T20:27:13.8780857Z #define _GLIBCXX_USE_DEV_RANDOM 1
2025-05-07T20:27:13.8781132Z #define _STL_ITERATOR_BASE_TYPES_H 1
2025-05-07T20:27:13.8781414Z #define __cpp_template_auto 201606L
2025-05-07T20:27:13.8781766Z #define __DBL_MIN__ double(2.22507385850720138309023271733240406e-308L)
2025-05-07T20:27:13.8782133Z #define _GLIBCXX_HAVE_SYS_SEM_H 1
2025-05-07T20:27:13.8782390Z #define __key_t_defined 
2025-05-07T20:27:13.8782633Z #define _IO_MAGIC_MASK 0xFFFF0000
2025-05-07T20:27:13.8782996Z #define __cluster_dims__(...) __attribute__((cluster_dims(__VA_ARGS__)))
2025-05-07T20:27:13.8783450Z #define __FLT128_EPSILON__ 1.92592994438723585305597794258492732e-34F128
2025-05-07T20:27:13.8783813Z #define __GNUC_VA_LIST 
2025-05-07T20:27:13.8784144Z #define __FLT64X_NORM_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:27:13.8784521Z #define __SIZEOF_POINTER__ 8
2025-05-07T20:27:13.8784781Z #define CLOCK_REALTIME_COARSE 5
2025-05-07T20:27:13.8785053Z #define _GLIBCXX14_CONSTEXPR constexpr
2025-05-07T20:27:13.8785329Z #define __USE_XOPEN2KXSI 1
2025-05-07T20:27:13.8785571Z #define __WCOREFLAG 0x80
2025-05-07T20:27:13.8785817Z #define M_2_SQRTPI 1.12837916709551257390
2025-05-07T20:27:13.8786105Z #define cudaEventDisableTiming 0x02
2025-05-07T20:27:13.8786377Z #define __LP64__ 1
2025-05-07T20:27:13.8786616Z #define __isascii_l(c,l) ((l), __isascii (c))
2025-05-07T20:27:13.8786914Z #define cudaStreamNonBlocking 0x01
2025-05-07T20:27:13.8787187Z #define _IO_off64_t __off64_t
2025-05-07T20:27:13.8787439Z #define __DBL_HAS_QUIET_NAN__ 1
2025-05-07T20:27:13.8787812Z #define __time_t_defined 1
2025-05-07T20:27:13.8788052Z #define _POSIX_SYMLOOP_MAX 8
2025-05-07T20:27:13.8788389Z #define __FLT32X_EPSILON__ 2.22044604925031308084726333618164062e-16F32x
2025-05-07T20:27:13.8788750Z #define __USE_UNIX98 1
2025-05-07T20:27:13.8788972Z #define __MODE_T_TYPE __U32_TYPE
2025-05-07T20:27:13.8789238Z #define CLOCK_REALTIME_ALARM 8
2025-05-07T20:27:13.8789498Z #define _GLIBCXX_HAVE_STRINGS_H 1
2025-05-07T20:27:13.8789778Z #define __LEAF_ATTR __attribute__ ((__leaf__))
2025-05-07T20:27:13.8790078Z #define __DECIMAL_BID_FORMAT__ 1
2025-05-07T20:27:13.8790327Z #define SEEK_CUR 1
2025-05-07T20:27:13.8790542Z #define __RLIM64_T_TYPE __UQUAD_TYPE
2025-05-07T20:27:13.8790804Z #define _ASSERT_H 1
2025-05-07T20:27:13.8791353Z #define _PSTL_PRAGMA_DECLARE_REDUCTION(NAME,OP) _PSTL_PRAGMA(omp declare reduction(NAME:OP : omp_out(omp_in)) initializer(omp_priv = omp_orig))
2025-05-07T20:27:13.8791956Z #define _GLIBCXX_USE_DEPRECATED 1
2025-05-07T20:27:13.8792216Z #define CHAR_MAX SCHAR_MAX
2025-05-07T20:27:13.8792463Z #define _GLIBCXX_HAVE_SETENV 1
2025-05-07T20:27:13.8792720Z #define NL_ARGMAX _POSIX_ARG_MAX
2025-05-07T20:27:13.8792975Z #define _GLIBCXX_USE_UTIMENSAT 1
2025-05-07T20:27:13.8793332Z #define __extern_inline extern __inline __attribute__ ((__gnu_inline__))
2025-05-07T20:27:13.8793729Z #define _GLIBCXX_DEBUG_ONLY(_Statement) 
2025-05-07T20:27:13.8794360Z #define _IO_putc_unlocked(_ch,_fp) (_IO_BE ((_fp)->_IO_write_ptr >= (_fp)->_IO_write_end, 0) ? __overflow (_fp, (unsigned char) (_ch)) : (unsigned char) (*(_fp)->_IO_write_ptr++ = (_ch)))
2025-05-07T20:27:13.8794989Z #define _GLIBCXX_HAVE_BUILTIN_LAUNDER 1
2025-05-07T20:27:13.8795276Z #define _IO_BOOLALPHA 0200000
2025-05-07T20:27:13.8795718Z #define _PSTL_CPP17_EXECUTION_POLICIES_PRESENT (_MSC_VER >= 1912)
2025-05-07T20:27:13.8796077Z #define _GLIBCXX_PACKAGE_URL ""
2025-05-07T20:27:13.8796335Z #define __FLT64_MIN_10_EXP__ (-307)
2025-05-07T20:27:13.8796609Z #define cudaArrayDefault 0x00
2025-05-07T20:27:13.8796872Z #define __cudaCDP2LaunchDeviceV2 
2025-05-07T20:27:13.8797152Z #define __FDS_BITS(set) ((set)->fds_bits)
2025-05-07T20:27:13.8797426Z #define TLOSS 5
2025-05-07T20:27:13.8797630Z #define __ssize_t_defined 
2025-05-07T20:27:13.8797873Z #define __CUDACC_VER_BUILD__ 61
2025-05-07T20:27:13.8798226Z #define ULONG_MAX (LONG_MAX * 2UL + 1UL)
2025-05-07T20:27:13.8798506Z #define __FLT64X_DECIMAL_DIG__ 21
2025-05-07T20:27:13.8798774Z #define _POSIX_HIWAT _POSIX_PIPE_BUF
2025-05-07T20:27:13.8799051Z #define __DEC128_MIN__ 1E-6143DL
2025-05-07T20:27:13.8799319Z #define __cudaCDP2EventRecordWithFlags 
2025-05-07T20:27:13.8799623Z #define _GLIBCXX_ATOMIC_BUILTINS 1
2025-05-07T20:27:13.8799913Z #define cudaPeerAccessDefault 0x00
2025-05-07T20:27:13.8800252Z #define _GLIBCXX_HAVE_SYS_SOCKET_H 1
2025-05-07T20:27:13.8800618Z #define __REGISTER_PREFIX__ 
2025-05-07T20:27:13.8800980Z #define __UINT16_MAX__ 0xffff
2025-05-07T20:27:13.8801419Z #define __glibcxx_requires_sorted_set(_First1,_Last1,_First2) 
2025-05-07T20:27:13.8801771Z #define _IOS_NOREPLACE 64
2025-05-07T20:27:13.8802005Z #define __cdecl 
2025-05-07T20:27:13.8802233Z #define cudaEventInterprocess 0x04
2025-05-07T20:27:13.8802548Z #define M_SQRT1_2l 0.707106781186547524400844362104849039L
2025-05-07T20:27:13.8802870Z #define LOGIN_NAME_MAX 256
2025-05-07T20:27:13.8803123Z #define _IO_TIED_PUT_GET 0x400
2025-05-07T20:27:13.8803380Z #define X_TLOSS 1.41484755040568800000e+16
2025-05-07T20:27:13.8803659Z #define CUDA_IPC_HANDLE_SIZE 64
2025-05-07T20:27:13.8803917Z #define __LDBL_HAS_INFINITY__ 1
2025-05-07T20:27:13.8804221Z #define __attribute_pure__ __attribute__ ((__pure__))
2025-05-07T20:27:13.8804539Z #define __TEXTURE_TYPES_H__ 
2025-05-07T20:27:13.8804939Z #define __NV_GLIBCXX_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)
2025-05-07T20:27:13.8805365Z #define ADJ_NANO 0x2000
2025-05-07T20:27:13.8805655Z #define __FLT32_MIN__ 1.17549435082228750796873653722224568e-38F32
2025-05-07T20:27:13.8806002Z #define __UINT8_TYPE__ unsigned char
2025-05-07T20:27:13.8806279Z #define _GLIBCXX_HAVE_ISWBLANK 1
2025-05-07T20:27:13.8806528Z #define __FLT_DIG__ 6
2025-05-07T20:27:13.8806864Z #define __REDIRECT_LDBL(name,proto,alias) __REDIRECT (name, proto, alias)
2025-05-07T20:27:13.8807257Z #define __NO_INLINE__ 1
2025-05-07T20:27:13.8807550Z #define _PSTL_EARLYEXIT_PRESENT (__INTEL_COMPILER >= 1800)
2025-05-07T20:27:13.8807890Z #define _POSIX_NGROUPS_MAX 8
2025-05-07T20:27:13.8808148Z #define ADJ_STATUS 0x0010
2025-05-07T20:27:13.8808442Z #define __cudaCDP2MemcpyAsync_ptsz 
2025-05-07T20:27:13.8808715Z #define CLOCK_BOOTTIME_ALARM 9
2025-05-07T20:27:13.8808978Z #define LONG_LONG_MAX __LONG_LONG_MAX__
2025-05-07T20:27:13.8809272Z #define _GLIBCXX_HAVE_OBSOLETE_ISNAN 1
2025-05-07T20:27:13.8809558Z #define __DEC_EVAL_METHOD__ 2
2025-05-07T20:27:13.8809939Z #define cudaStreamGraphFireAndForget (cudaStream_t)0x0200000000000000
2025-05-07T20:27:13.8810345Z #define _GLIBCXX_HAVE_ALIGNED_ALLOC 1
2025-05-07T20:27:13.8810677Z #define __DEC128_MAX__ 9.999999999999999999999999999999999E6144DL
2025-05-07T20:27:13.8811010Z #define CHAR_MIN SCHAR_MIN
2025-05-07T20:27:13.8811245Z #define MAX_CANON 255
2025-05-07T20:27:13.8811464Z #define __FLT_MANT_DIG__ 24
2025-05-07T20:27:13.8811712Z #define __LDBL_DECIMAL_DIG__ 21
2025-05-07T20:27:13.8811973Z #define _GLIBCXX_HAVE_COMPLEX_H 1
2025-05-07T20:27:13.8812258Z #define _PSTL_PRAGMA_VECTOR_UNALIGNED 
2025-05-07T20:27:13.8812557Z #define _POSIX_FD_SETSIZE _POSIX_OPEN_MAX
2025-05-07T20:27:13.8812851Z #define _GLIBCXX_HAVE_HYPOT 1
2025-05-07T20:27:13.8813118Z #define __cudaCDP2Memset2DAsync_ptsz 
2025-05-07T20:27:13.8813425Z #define _GLIBCXX_TR1_MODIFIED_BESSEL_FUNC_TCC 1
2025-05-07T20:27:13.8813734Z #define __VERSION__ "11.4.0"
2025-05-07T20:27:13.8814091Z #define _GLIBCXX11_USE_C99_STDLIB 1
2025-05-07T20:27:13.8814374Z #define cudaHostRegisterMapped 0x02
2025-05-07T20:27:13.8814657Z #define _GLIBCXX_HAVE_INT64_T 1
2025-05-07T20:27:13.8814929Z #define _GLIBCXX_USE_CONSTEXPR constexpr
2025-05-07T20:27:13.8815224Z #define FD_ZERO(fdsetp) __FD_ZERO (fdsetp)
2025-05-07T20:27:13.8815513Z #define __UINT64_C(c) c ## UL
2025-05-07T20:27:13.8815765Z #define MOD_OFFSET ADJ_OFFSET
2025-05-07T20:27:13.8816002Z #define _SYS_TYPES_H 1
2025-05-07T20:27:13.8816235Z #define AIO_PRIO_DELTA_MAX 20
2025-05-07T20:27:13.8816492Z #define _GLIBCXX_HAVE_TANHF 1
2025-05-07T20:27:13.8816815Z #define _SYS_CDEFS_H 1
2025-05-07T20:27:13.8817041Z #define _GLIBCXX_HAVE_TANHL 1
2025-05-07T20:27:13.8817308Z #define __cpp_unicode_characters 201411L
2025-05-07T20:27:13.8817589Z #define _IO_ERR_SEEN 0x20
2025-05-07T20:27:13.8817828Z #define _GLIBCXX_USE_DECIMAL_FLOAT 1
2025-05-07T20:27:13.8818112Z #define __cudaCDP2StreamDestroy 
2025-05-07T20:27:13.8818382Z #define FP_SUBNORMAL 3
2025-05-07T20:27:13.8818617Z #define cudaOccupancyDefault 0x00
2025-05-07T20:27:13.8818888Z #define _INITIALIZER_LIST 
2025-05-07T20:27:13.8819132Z #define _STDC_PREDEF_H 1
2025-05-07T20:27:13.8819393Z #define _GLIBCXX_PACKAGE_BUGREPORT ""
2025-05-07T20:27:13.8819678Z #define _GLIBCXX_HAVE_MODF 1
2025-05-07T20:27:13.8819922Z #define _IO_file_flags _flags
2025-05-07T20:27:13.8820172Z #define __USE_XOPEN2K8 1
2025-05-07T20:27:13.8820419Z #define htobe64(x) __bswap_64 (x)
2025-05-07T20:27:13.8820691Z #define _OLD_STDIO_MAGIC 0xFABC0000
2025-05-07T20:27:13.8820951Z #define HUGE 3.40282347e+38F
2025-05-07T20:27:13.8821218Z #define __cpp_lib_is_null_pointer 201309
2025-05-07T20:27:13.8821586Z #define WEXITSTATUS(status) __WEXITSTATUS (__WAIT_INT (status))
2025-05-07T20:27:13.8821961Z #define islower_l(c,l) __islower_l ((c), (l))
2025-05-07T20:27:13.8822259Z #define _GLIBCXX_USE_CXX11_ABI 1
2025-05-07T20:27:13.8822526Z #define _GLIBCXX_HAVE_SYMLINK 1
2025-05-07T20:27:13.8822769Z #define _BSD_SOURCE 1
2025-05-07T20:27:13.8823002Z #define _GLIBCXX_THROW(_EXC) 
2025-05-07T20:27:13.8823823Z #define _GLIBCXX_HAS_NESTED_TYPE(_NTYPE) template<typename _Tp, typename = __void_t<>> struct __has_ ##_NTYPE : false_type { }; template<typename _Tp> struct __has_ ##_NTYPE<_Tp, __void_t<typename _Tp::_NTYPE>> : true_type { };
2025-05-07T20:27:13.8824657Z #define __catch(X) catch(X)
2025-05-07T20:27:13.8824906Z #define __INT_LEAST32_MAX__ 0x7fffffff
2025-05-07T20:27:13.8825185Z #define LINE_MAX _POSIX2_LINE_MAX
2025-05-07T20:27:13.8825452Z #define __TIMER_T_TYPE void *
2025-05-07T20:27:13.8825688Z #define __STRING(x) #x
2025-05-07T20:27:13.8825972Z #define __GCC_ATOMIC_INT_LOCK_FREE 2
2025-05-07T20:27:13.8826292Z #define _T_PTRDIFF_ 
2025-05-07T20:27:13.8826525Z #define _GLIBCXX_USE_NOEXCEPT noexcept
2025-05-07T20:27:13.8826826Z #define cudaEventWaitExternal 0x01
2025-05-07T20:27:13.8827088Z #define __unbounded 
2025-05-07T20:27:13.8827313Z #define __DEVICE_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:27:13.8827709Z #define __FLT128_MAX_EXP__ 16384
2025-05-07T20:27:13.8827983Z #define __INO_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:27:13.8828269Z #define be16toh(x) __bswap_16 (x)
2025-05-07T20:27:13.8828546Z #define __cpp_lib_is_final 201402L
2025-05-07T20:27:13.8828834Z #define _GLIBCXX_BEGIN_NAMESPACE_CONTAINER 
2025-05-07T20:27:13.8829153Z #define LONG_LONG_MIN (-LONG_LONG_MAX - 1LL)
2025-05-07T20:27:13.8829446Z #define __MATH_DECLARE_LDOUBLE 1
2025-05-07T20:27:13.8829802Z #define __managed__ __location__(managed)
2025-05-07T20:27:13.8830093Z #define _POSIX2_EXPR_NEST_MAX 32
2025-05-07T20:27:13.8830483Z #define __GNUC_PREREQ(maj,min) ((__GNUC__ << 16) + __GNUC_MINOR__ >= ((maj) << 16) + (min))
2025-05-07T20:27:13.8830897Z #define _POSIX_STREAM_MAX 8
2025-05-07T20:27:13.8831151Z #define __LIBRARY_TYPES_H__ 
2025-05-07T20:27:13.8831511Z #define _GLIBCXX_END_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_END_NAMESPACE_CXX11
2025-05-07T20:27:13.8831914Z #define __FLT32_MANT_DIG__ 24
2025-05-07T20:27:13.8832155Z #define _SYS_SIZE_T_H 
2025-05-07T20:27:13.8832543Z #define _PSTL_VERSION_MINOR ((_PSTL_VERSION % 1000) / 10)
2025-05-07T20:27:13.8832869Z #define _GLIBCXX_STDLIB_H 1
2025-05-07T20:27:13.8833150Z #define isupper_l(c,l) __isupper_l ((c), (l))
2025-05-07T20:27:13.8833437Z #define _CRTIMP 
2025-05-07T20:27:13.8833652Z #define _GLIBCXX_CXX_CONFIG_H 1
2025-05-07T20:27:13.8833949Z #define __FLOAT_WORD_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:27:13.8834268Z #define STA_PPSJITTER 0x0200
2025-05-07T20:27:13.8834607Z #define _IO_feof_unlocked(__fp) (((__fp)->_flags & _IO_EOF_SEEN) != 0)
2025-05-07T20:27:13.8835016Z #define __SUSECONDS_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:27:13.8835448Z #define _GLIBCXX_HAVE_ISINFF 1
2025-05-07T20:27:13.8835714Z #define __glibcxx_requires_subscript(_N) 
2025-05-07T20:27:13.8835991Z #define __SIZE_T__ 
2025-05-07T20:27:13.8836201Z #define __stub_gtty 
2025-05-07T20:27:13.8836414Z #define __pid_t_defined 
2025-05-07T20:27:13.8836682Z #define _GLIBCXX_FWDREF(_Tp) _Tp&&
2025-05-07T20:27:13.8836987Z #define __NLINK_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:27:13.8837301Z #define __glibcxx_function_requires(...) 
2025-05-07T20:27:13.8837578Z #define __SM_80_RT_HPP__ 
2025-05-07T20:27:13.8837818Z #define __need_clockid_t 
2025-05-07T20:27:13.8838063Z #define SSIZE_MAX LONG_MAX
2025-05-07T20:27:13.8838307Z #define _GLIBCXX_HAVE_USELOCALE 1
2025-05-07T20:27:13.8838619Z #define __glibcxx_requires_string_len(_String,_Len) 
2025-05-07T20:27:13.8838927Z #define _IO_HEX 0100
2025-05-07T20:27:13.8839170Z #define __NFDBITS (8 * (int) sizeof (__fd_mask))
2025-05-07T20:27:13.8839502Z #define cudaExternalMemoryDedicated 0x1
2025-05-07T20:27:13.8839602Z #define _GLIBCXX_HAVE_TGMATH_H 1
2025-05-07T20:27:13.8839703Z #define _GLIBCXX11_USE_C99_COMPLEX 1
2025-05-07T20:27:13.8839919Z #define _GLIBCXX17_DEPRECATED_SUGGEST(ALT) _GLIBCXX_DEPRECATED_SUGGEST(ALT)
2025-05-07T20:27:13.8840031Z #define ispunct_l(c,l) __ispunct_l ((c), (l))
2025-05-07T20:27:13.8840517Z #define __cpp_aggregate_bases 201603L
2025-05-07T20:27:13.8840665Z #define __cudaGet_blockDim() blockDim
2025-05-07T20:27:13.8840784Z #define __cudaCDP2Memcpy3DAsync 
2025-05-07T20:27:13.8840884Z #define __cudaCDP2MemcpyAsync 
2025-05-07T20:27:13.8840963Z #define __stub_sstk 
2025-05-07T20:27:13.8841058Z #define _IO_IN_BACKUP 0x100
2025-05-07T20:27:13.8841206Z #define _GLIBCXX_USE_C99_STDLIB _GLIBCXX11_USE_C99_STDLIB
2025-05-07T20:27:13.8841283Z #define __wur 
2025-05-07T20:27:13.8841399Z #define isprint_l(c,l) __isprint_l ((c), (l))
2025-05-07T20:27:13.8841482Z #define _G_HAVE_MMAP 1
2025-05-07T20:27:13.8841560Z #define _IO_OCT 040
2025-05-07T20:27:13.8841654Z #define __FLT128_HAS_DENORM__ 1
2025-05-07T20:27:13.8841746Z #define NL_MSGMAX INT_MAX
2025-05-07T20:27:13.8841833Z #define _GLIBCXX_USE_LFS 1
2025-05-07T20:27:13.8841962Z #define cudaDeviceScheduleBlockingSync 0x04
2025-05-07T20:27:13.8842049Z #define _POSIX_RTSIG_MAX 8
2025-05-07T20:27:13.8842151Z #define _GLIBCXX_NOEXCEPT noexcept
2025-05-07T20:27:13.8842336Z #define __glibcxx_requires_partitioned_lower(_First,_Last,_Value) 
2025-05-07T20:27:13.8842432Z #define __FLT32_DECIMAL_DIG__ 9
2025-05-07T20:27:13.8842525Z #define _STL_ALGOBASE_H 1
2025-05-07T20:27:13.8842627Z #define __cudaCDP2MemsetAsync_ptsz 
2025-05-07T20:27:13.8842711Z #define __off64_t_defined 
2025-05-07T20:27:13.8842812Z #define _GLIBCXX_WEAK_DEFINITION 
2025-05-07T20:27:13.8842898Z #define __FLT128_DIG__ 33
2025-05-07T20:27:13.8842997Z #define _GLIBCXX_USE_C99_INTTYPES_TR1 1
2025-05-07T20:27:13.8843094Z #define _GLIBCXX_HAVE_LOCALE_H 1
2025-05-07T20:27:13.8843173Z #define __INT32_C(c) c
2025-05-07T20:27:13.8843269Z #define __DEC64_EPSILON__ 1E-15DD
2025-05-07T20:27:13.8843365Z #define __ORDER_PDP_ENDIAN__ 3412
2025-05-07T20:27:13.8843457Z #define __DEC128_MIN_EXP__ (-6142)
2025-05-07T20:27:13.8843548Z #define __PDP_ENDIAN 3412
2025-05-07T20:27:13.8843632Z #define _ISOC95_SOURCE 1
2025-05-07T20:27:13.8843722Z #define _IO_fpos64_t _G_fpos64_t
2025-05-07T20:27:13.8843855Z #define M_PI_2l 1.570796326794896619231321691639751442L
2025-05-07T20:27:13.8844190Z #define BYTE_ORDER __BYTE_ORDER
2025-05-07T20:27:13.8844275Z #define __SM_90_RT_HPP__ 
2025-05-07T20:27:13.8844375Z #define __INT_FAST32_TYPE__ long int
2025-05-07T20:27:13.8844468Z #define __have_pthread_attr_t 1
2025-05-07T20:27:13.8844563Z #define _GLIBCXX_HAVE_LIMIT_DATA 1
2025-05-07T20:27:13.8844784Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL_OR_CXX11 _GLIBCXX_BEGIN_NAMESPACE_CXX11
2025-05-07T20:27:13.8844889Z #define __cudaCDP2StreamWaitEvent 
2025-05-07T20:27:13.8844995Z #define __cudaCDP2EventRecord 
2025-05-07T20:27:13.8845084Z #define _BITS_TYPESIZES_H 1
2025-05-07T20:27:13.8845167Z #define htole32(x) (x)
2025-05-07T20:27:13.8845538Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessorWithFlags 
2025-05-07T20:27:13.8845656Z #define __SYSCALL_SLONG_TYPE __SLONGWORD_TYPE
2025-05-07T20:27:13.8845750Z #define _GLIBCXX_USE_C99_MATH_TR1 1
2025-05-07T20:27:13.8845907Z #define WSTOPSIG(status) __WSTOPSIG (__WAIT_INT (status))
2025-05-07T20:27:13.8846043Z #define _GLIBCXX_USE_C99_MATH _GLIBCXX11_USE_C99_MATH
2025-05-07T20:27:13.8846169Z #define __UINT_LEAST16_TYPE__ short unsigned int
2025-05-07T20:27:13.8846310Z #define __WIFEXITED(status) (__WTERMSIG(status) == 0)
2025-05-07T20:27:13.8846401Z #define ADJ_OFFSET 0x0001
2025-05-07T20:27:13.8846502Z #define cudaArrayLayered 0x01
2025-05-07T20:27:13.8846666Z #define _PSTL_ICC_18_OMP_SIMD_BROKEN (__INTEL_COMPILER == 1800)
2025-05-07T20:27:13.8846773Z #define cudaEventRecordDefault 0x00
2025-05-07T20:27:13.8846872Z #define _GLIBCXX_HAVE_FMODF 1
2025-05-07T20:27:13.8846968Z #define _PSTL_PRAGMA_MESSAGE(x) 
2025-05-07T20:27:13.8847045Z #define unix 1
2025-05-07T20:27:13.8847151Z #define __DBL_HAS_DENORM__ 1
2025-05-07T20:27:13.8847241Z #define _POSIX_CHILD_MAX 25
2025-05-07T20:27:13.8847333Z #define _POSIX_MAX_INPUT 255
2025-05-07T20:27:13.8847455Z #define __cudaCDP2DeviceGetCacheConfig 
2025-05-07T20:27:13.8847537Z #define __USE_POSIX 1
2025-05-07T20:27:13.8847627Z #define __FD_ZERO_STOS "stosq"
2025-05-07T20:27:13.8847764Z #define _PSTL_VERSION_MAJOR (_PSTL_VERSION / 1000)
2025-05-07T20:27:13.8847861Z #define __THROWNL throw ()
2025-05-07T20:27:13.8847954Z #define __cpp_rtti 199711L
2025-05-07T20:27:13.8848055Z #define __SIZE_TYPE__ long unsigned int
2025-05-07T20:27:13.8848141Z #define __PMT(args) args
2025-05-07T20:27:13.8848257Z #define __UINT64_MAX__ 0xffffffffffffffffUL
2025-05-07T20:27:13.8848401Z #define __va_arg_pack_len() __builtin_va_arg_pack_len ()
2025-05-07T20:27:13.8848510Z #define __ULONGWORD_TYPE unsigned long int
2025-05-07T20:27:13.8848605Z #define _SIZE_T_DECLARED 
2025-05-07T20:27:13.8848699Z #define _PSTL_STRING_AUX(x) #x
2025-05-07T20:27:13.8848787Z #define __FLT_IS_IEC_60559__ 2
2025-05-07T20:27:13.8849180Z #define _PSTL_CPP14_MAKE_REVERSE_ITERATOR_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201402L || __cpp_lib_make_reverse_iterator == 201402)
2025-05-07T20:27:13.8849277Z #define _GLIBCXX_HAVE_LIMIT_AS 1
2025-05-07T20:27:13.8849374Z #define XATTR_LIST_MAX 65536
2025-05-07T20:27:13.8849464Z #define __CUDACC_VER_MAJOR__ 12
2025-05-07T20:27:13.8849610Z #define __GNUC_WIDE_EXECUTION_CHARSET_NAME "UTF-32LE"
2025-05-07T20:27:13.8849700Z #define _WCHAR_T_H 
2025-05-07T20:27:13.8849787Z #define __FLT64X_DIG__ 18
2025-05-07T20:27:13.8849875Z #define _IO_SHOWBASE 0200
2025-05-07T20:27:13.8849967Z #define _POSIX_QLIMIT 1
2025-05-07T20:27:13.8850060Z #define __INT8_TYPE__ signed char
2025-05-07T20:27:13.8850154Z #define __SURFACE_TYPES_H__ 
2025-05-07T20:27:13.8850247Z #define __CUDA_ARCH__ 520
2025-05-07T20:27:13.8850350Z #define __cpp_digit_separators 201309L
2025-05-07T20:27:13.8850429Z #define __ELF__ 1
2025-05-07T20:27:13.8850531Z #define CLOCK_THREAD_CPUTIME_ID 3
2025-05-07T20:27:13.8850628Z #define __GCC_ASM_FLAG_OUTPUTS__ 1
2025-05-07T20:27:13.8850715Z #define STA_INS 0x0010
2025-05-07T20:27:13.8850809Z #define __UINT32_TYPE__ unsigned int
2025-05-07T20:27:13.8850975Z #define _toupper(c) ((int) (*__ctype_toupper_loc ())[(int) (c)])
2025-05-07T20:27:13.8851072Z #define _BITS_BYTESWAP_H 1
2025-05-07T20:27:13.8851163Z #define __ID_T_TYPE __U32_TYPE
2025-05-07T20:27:13.8851366Z #define __TIME_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:27:13.8851477Z #define __DEVICE_DOUBLE_FUNCTIONS_HPP__ 
2025-05-07T20:27:13.8851571Z #define _GLIBCXX_HAVE_MBSTATE_T 1
2025-05-07T20:27:13.8851671Z #define __cpp_lib_logical_traits 201510
2025-05-07T20:27:13.8851770Z #define ADJ_OFFSET_SS_READ 0xa001
2025-05-07T20:27:13.8851920Z #define __warnattr(msg) __attribute__((__warning__ (msg)))
2025-05-07T20:27:13.8852079Z #define _PSTL_PRAGMA_LOCATION " [Parallel STL message]: "
2025-05-07T20:27:13.8852175Z #define _IO_funlockfile(_fp) 
2025-05-07T20:27:13.8852568Z #define cudaKernelNodeAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow
2025-05-07T20:27:13.8852702Z #define M_2_PIl 0.636619772367581343075535053490057448L
2025-05-07T20:27:13.8852792Z #define __DRIVER_TYPES_H__ 
2025-05-07T20:27:13.8852878Z #define __FLT_RADIX__ 2
2025-05-07T20:27:13.8852983Z #define __INT_LEAST16_TYPE__ short int
2025-05-07T20:27:13.8853145Z #define __LDBL_EPSILON__ 1.08420217248550443400745280086994171e-19L
2025-05-07T20:27:13.8853241Z #define __UINTMAX_C(c) c ## UL
2025-05-07T20:27:13.8853338Z #define _GLIBCXX_USE_LSTAT 1
2025-05-07T20:27:13.8853436Z #define minor(dev) gnu_dev_minor (dev)
2025-05-07T20:27:13.8853536Z #define _POSIX_C_SOURCE 200809L
2025-05-07T20:27:13.8853628Z #define _GLIBCXX_HAVE_DIRENT_H 1
2025-05-07T20:27:13.8853726Z #define __GLIBCXX_BITSIZE_INT_N_0 128
2025-05-07T20:27:13.8853813Z #define WORD_BIT 32
2025-05-07T20:27:13.8853898Z #define _IO_USER_BUF 1
2025-05-07T20:27:13.8853987Z #define __VECTOR_TYPES_H__ 
2025-05-07T20:27:13.8854091Z #define __SM_20_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:27:13.8854203Z #define cudaHostAllocPortable 0x01
2025-05-07T20:27:13.8854303Z #define PTHREAD_STACK_MIN 16384
2025-05-07T20:27:13.8854403Z #define __long_double_t long double
2025-05-07T20:27:13.8854495Z #define _GLIBCXX_HAVE_ISINF 1
2025-05-07T20:27:13.8854584Z #define _POSIX_ARG_MAX 4096
2025-05-07T20:27:13.8854976Z #define cudaKernelNodeAttributeDeviceUpdatableKernelNode cudaLaunchAttributeDeviceUpdatableKernelNode
2025-05-07T20:27:13.8855061Z #define __k8 1
2025-05-07T20:27:13.8855256Z #define _GLIBCXX_NO_OBSOLETE_ISINF_ISNAN_DYNAMIC __GLIBC_PREREQ(2,23)
2025-05-07T20:27:13.8855422Z #define __FLT32X_MIN__ 2.22507385850720138309023271733240406e-308F32x
2025-05-07T20:27:13.8855534Z #define __LDBL_REDIR(name,proto) name proto
2025-05-07T20:27:13.8855635Z #define __SIG_ATOMIC_MAX__ 0x7fffffff
2025-05-07T20:27:13.8855728Z #define __SM_30_INTRINSICS_HPP__ 
2025-05-07T20:27:13.8855822Z #define _GLIBCXX_EXTERN_TEMPLATE 1
2025-05-07T20:27:13.8855919Z #define __blksize_t_defined 
2025-05-07T20:27:13.8856014Z #define _IO_SHOWPOINT 0400
2025-05-07T20:27:13.8856108Z #define _GLIBCXX_HAVE_LIMIT_RSS 1
2025-05-07T20:27:13.8856224Z #define cudaDeviceLmemResizeToMax 0x10
2025-05-07T20:27:13.8856313Z #define _GLIBCXX_X86_RDRAND 1
2025-05-07T20:27:13.8856421Z #define __GCC_ATOMIC_WCHAR_T_LOCK_FREE 2
2025-05-07T20:27:13.8856512Z #define _IO_IS_FILEBUF 0x2000
2025-05-07T20:27:13.8856604Z #define _GLIBCXX_USE_DUAL_ABI 1
2025-05-07T20:27:13.8856863Z #define __bswap_constant_16(x) ((unsigned short int) ((((x) >> 8) & 0xff) | (((x) & 0xff) << 8)))
2025-05-07T20:27:13.8857206Z #define cudaSignalExternalSemaphoresAsync __CUDART_API_PTSZ(cudaSignalExternalSemaphoresAsync_v2)
2025-05-07T20:27:13.8857302Z #define UCHAR_MAX (SCHAR_MAX * 2 + 1)
2025-05-07T20:27:13.8857404Z #define __SIZEOF_PTRDIFF_T__ 8
2025-05-07T20:27:13.8857484Z #define SEEK_SET 0
2025-05-07T20:27:13.8857579Z #define _GLIBCXX_TR1_GAMMA_TCC 1
2025-05-07T20:27:13.8857679Z #define __CUDA_API_VER_MINOR__ 8
2025-05-07T20:27:13.8857870Z #define _GLIBCXX_VISIBILITY(V) __attribute__ ((__visibility__ (#V)))
2025-05-07T20:27:13.8857978Z #define __cudaCDP2GetLastError 
2025-05-07T20:27:13.8858070Z #define _GLIBCXX_HAVE_COSL 1
2025-05-07T20:27:13.8858157Z #define _MATH_H_MATHDEF 1
2025-05-07T20:27:13.8858576Z #define __bswap_constant_32(x) ((((x) & 0xff000000) >> 24) | (((x) & 0x00ff0000) >> 8) | (((x) & 0x0000ff00) << 8) | (((x) & 0x000000ff) << 24))
2025-05-07T20:27:13.8858846Z #define _GLIBCXX_USE_FLOAT128 1
2025-05-07T20:27:13.8858985Z #define _IO_FLAGS2_NOTCANCEL 2
2025-05-07T20:27:13.8859119Z #define __stub_sigreturn 
2025-05-07T20:27:13.8859444Z #define __errordecl(name,msg) extern void name (void) __attribute__((__error__ (msg)))
2025-05-07T20:27:13.8859540Z #define _GLIBCXX_HAVE_UTIME_H 1
2025-05-07T20:27:13.8859637Z #define __HOST_CONFIG_H__ 
2025-05-07T20:27:13.8859734Z #define _XOPEN_SOURCE_EXTENDED 1
2025-05-07T20:27:13.8859826Z #define CLOCK_TAI 11
2025-05-07T20:27:13.8859927Z #define _GLIBCXX_END_NAMESPACE_VERSION 
2025-05-07T20:27:13.8860212Z #define __glibcxx_requires_sorted_set_pred(_First1,_Last1,_First2,_Pred) 
2025-05-07T20:27:13.8860308Z #define __restrict_arr 
2025-05-07T20:27:13.8860417Z #define _PSTL_PRAGMA_MESSAGE_POLICIES(x) 
2025-05-07T20:27:13.8860553Z #define __glibcxx_requires_valid_range(_First,_Last) 
2025-05-07T20:27:13.8861095Z #define strndupa(s,n) (__extension__ ({ const char *__old = (s); size_t __len = strnlen (__old, (n)); char *__new = (char *) __builtin_alloca (__len + 1); __new[__len] = '\0'; (char *) memcpy (__new, __old, __len); }))
2025-05-07T20:27:13.8861281Z #define __attribute_artificial__ __attribute__ ((__artificial__))
2025-05-07T20:27:13.8861371Z #define __USE_MISC 1
2025-05-07T20:27:13.8861469Z #define __UWORD_TYPE unsigned long int
2025-05-07T20:27:13.8861565Z #define _EXCEPTION_DEFINES_H 1
2025-05-07T20:27:13.8861654Z #define _GCC_LIMITS_H_ 
2025-05-07T20:27:13.8861736Z #define __LDBL_DIG__ 18
2025-05-07T20:27:13.8861829Z #define __BIT_TYPES_DEFINED__ 1
2025-05-07T20:27:13.8861933Z #define __malloc_and_calloc_defined 
2025-05-07T20:27:13.8862027Z #define __FLT64_IS_IEC_60559__ 2
2025-05-07T20:27:13.8862126Z #define _GLIBCXX_HAVE_SYS_SYSINFO_H 1
2025-05-07T20:27:13.8862214Z #define __x86_64__ 1
2025-05-07T20:27:13.8862291Z #define _SIZE_T_ 
2025-05-07T20:27:13.8863151Z #define __bswap_constant_64(x) (__extension__ ((((x) & 0xff00000000000000ull) >> 56) | (((x) & 0x00ff000000000000ull) >> 40) | (((x) & 0x0000ff0000000000ull) >> 24) | (((x) & 0x000000ff00000000ull) >> 8) | (((x) & 0x00000000ff000000ull) << 8) | (((x) & 0x0000000000ff0000ull) << 24) | (((x) & 0x000000000000ff00ull) << 40) | (((x) & 0x00000000000000ffull) << 56)))
2025-05-07T20:27:13.8863256Z #define _POSIX2_COLL_WEIGHTS_MAX 2
2025-05-07T20:27:13.8863348Z #define __FLT32X_MIN_EXP__ (-1021)
2025-05-07T20:27:13.8863465Z #define __PTHREAD_RWLOCK_INT_FLAGS_SHARED 1
2025-05-07T20:27:13.8863579Z #define __DEC32_SUBNORMAL_MIN__ 0.000001E-95DF
2025-05-07T20:27:13.8863671Z #define _IO_iconv_t _G_iconv_t
2025-05-07T20:27:13.8863781Z #define _GLIBCXX_FLOAT_IS_IEEE_BINARY32 1
2025-05-07T20:27:13.8863902Z #define __cpp_lib_make_reverse_iterator 201402
2025-05-07T20:27:13.8864036Z #define _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(A) 
2025-05-07T20:27:13.8864136Z #define _GLIBCXX_HAVE_DLFCN_H 1
2025-05-07T20:27:13.8864586Z #define strdupa(s) (__extension__ ({ const char *__old = (s); size_t __len = strlen (__old) + 1; char *__new = (char *) __builtin_alloca (__len); (char *) memcpy (__new, __old, __len); }))
2025-05-07T20:27:13.8864714Z #define __no_return__ __attribute__((noreturn))
2025-05-07T20:27:13.8864854Z #define __device_builtin__ __location__(device_builtin)
2025-05-07T20:27:13.8864949Z #define _PSTL_HIDE_FROM_ABI_POP 
2025-05-07T20:27:13.8865047Z #define _GLIBCXX_HAVE_ACOSF 1
2025-05-07T20:27:13.8865131Z #define STA_FLL 0x0008
2025-05-07T20:27:13.8865268Z #define _GLIBCXX_HAVE_BUILTIN_IS_CONSTANT_EVALUATED 1
2025-05-07T20:27:13.8865364Z #define _GLIBCXX_END_EXTERN_C }
2025-05-07T20:27:13.8865480Z #define __INT_FAST16_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:13.8865595Z #define __cpp_lib_integer_sequence 201304
2025-05-07T20:27:13.8865677Z #define __stub_revoke 
2025-05-07T20:27:13.8865763Z #define __timer_t_defined 1
2025-05-07T20:27:13.8865898Z #define _GLIBCXX11_DEPRECATED _GLIBCXX_DEPRECATED
2025-05-07T20:27:13.8865987Z #define INT_MAX __INT_MAX__
2025-05-07T20:27:13.8866087Z #define ULLONG_MAX (LLONG_MAX * 2ULL + 1)
2025-05-07T20:27:13.8866193Z #define _GLIBCXX_END_NAMESPACE_CXX11 }
2025-05-07T20:27:13.8866411Z #define _GLIBCXX_ICONV_CONST 
2025-05-07T20:27:13.8866506Z #define major(dev) gnu_dev_major (dev)
2025-05-07T20:27:13.8866619Z #define cudaArrayTextureGather 0x08
2025-05-07T20:27:13.8866715Z #define _GLIBCXX_LT_OBJDIR ".libs/"
2025-05-07T20:27:13.8866857Z #define __inline_hint__ __attribute__((nv_inline_hint))
2025-05-07T20:27:13.8866952Z #define __NV_LEGACY_LAUNCH 1
2025-05-07T20:27:13.8867037Z #define _IO_off_t __off_t
2025-05-07T20:27:13.8867129Z #define __FLT64_DIG__ 15
2025-05-07T20:27:13.8867348Z #define PTHREAD_DESTRUCTOR_ITERATIONS _POSIX_THREAD_DESTRUCTOR_ITERATIONS
2025-05-07T20:27:13.8867515Z #define _POSIX2_LINE_MAX 2048
2025-05-07T20:27:13.8867778Z #define __UINT_FAST32_MAX__ 0xffffffffffffffffUL
2025-05-07T20:27:13.8867899Z #define __UINT_LEAST64_TYPE__ long unsigned int
2025-05-07T20:27:13.8867991Z #define ADJ_FREQUENCY 0x0002
2025-05-07T20:27:13.8868095Z #define __CUDART_API_PTDS(api) api
2025-05-07T20:27:13.8868177Z #define NULL __null
2025-05-07T20:27:13.8868308Z #define cudaStreamPerThread ((cudaStream_t)0x2)
2025-05-07T20:27:13.8868412Z #define _GLIBCXX_CONSTEXPR constexpr
2025-05-07T20:27:13.8868506Z #define __U64_TYPE unsigned long int
2025-05-07T20:27:13.8868604Z #define __FLT_HAS_QUIET_NAN__ 1
2025-05-07T20:27:13.8868695Z #define __FLT_MAX_10_EXP__ 38
2025-05-07T20:27:13.8868771Z #define FP_ZERO 2
2025-05-07T20:27:13.8868866Z #define _GLIBCXX_HAVE_FLOORL 1
2025-05-07T20:27:13.8869015Z #define __isgraph_l(c,l) __isctype_l((c), _ISgraph, (l))
2025-05-07T20:27:13.8869117Z #define __LONG_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:13.8869203Z #define __WCHAR_T__ 
2025-05-07T20:27:13.8869299Z #define __FLT64X_HAS_DENORM__ 1
2025-05-07T20:27:13.8869487Z #define __DEC128_SUBNORMAL_MIN__ 0.000000000000000000000000000000001E-6143DL
2025-05-07T20:27:13.8869639Z #define _GLIBCXX_NORETURN __attribute__ ((__noreturn__))
2025-05-07T20:27:13.8869731Z #define __FLT_HAS_INFINITY__ 1
2025-05-07T20:27:13.8869852Z #define __GNUC_EXECUTION_CHARSET_NAME "UTF-8"
2025-05-07T20:27:13.8869967Z #define _GLIBCXX20_DEPRECATED_SUGGEST(ALT) 
2025-05-07T20:27:13.8870091Z #define __WSTOPSIG(status) __WEXITSTATUS(status)
2025-05-07T20:27:13.8870220Z #define cudaSurfaceTypeCubemapLayered 0xFC
2025-05-07T20:27:13.8870308Z #define _BSD_PTRDIFF_T_ 
2025-05-07T20:27:13.8870395Z #define _SIGSET_H_types 1
2025-05-07T20:27:13.8870528Z #define cudaTextureType1DLayered 0xF1
2025-05-07T20:27:13.8870732Z #define __cpp_unicode_literals 200710L
2025-05-07T20:27:13.8870907Z #define __isdigit_l(c,l) __isctype_l((c), _ISdigit, (l))
2025-05-07T20:27:13.8871072Z #define __LONG_LONG_PAIR(HI,LO) LO, HI
2025-05-07T20:27:13.8871214Z #define __UINT_FAST16_TYPE__ long unsigned int
2025-05-07T20:27:13.8879918Z #define __bos0(ptr) __builtin_object_size (ptr, 0)
2025-05-07T20:27:13.8880042Z #define __DEC64_MAX__ 9.999999999999999E384DD
2025-05-07T20:27:13.8880170Z #define M_1_PIl 0.318309886183790671537767526745028724L
2025-05-07T20:27:13.8880279Z #define __CUDACC_DEVICE_ATOMIC_BUILTINS__ 1
2025-05-07T20:27:13.8880476Z #define WIFSTOPPED(status) __WIFSTOPPED (__WAIT_INT (status))
2025-05-07T20:27:13.8880574Z #define __INT_FAST32_WIDTH__ 64
2025-05-07T20:27:13.8880679Z #define _POSIX2_CHARCLASS_NAME_MAX 14
2025-05-07T20:27:13.8880785Z #define _GLIBCXX_BITS_STD_ABS_H 
2025-05-07T20:27:13.8880873Z #define STA_MODE 0x4000
2025-05-07T20:27:13.8880981Z #define __CHAR16_TYPE__ short unsigned int
2025-05-07T20:27:13.8881097Z #define __PRAGMA_REDEFINE_EXTNAME 1
2025-05-07T20:27:13.8881210Z #define __glibcxx_signed_b(T,B) ((T)(-1) < 0)
2025-05-07T20:27:13.8881318Z #define __USING_NAMESPACE_C99(name) 
2025-05-07T20:27:13.8881416Z #define BIG_ENDIAN __BIG_ENDIAN
2025-05-07T20:27:13.8881521Z #define __cudaCDP2EventRecord_ptsz 
2025-05-07T20:27:13.8881626Z #define _GLIBCXX_HAVE_SINL 1
2025-05-07T20:27:13.8881739Z #define EXPR_NEST_MAX _POSIX2_EXPR_NEST_MAX
2025-05-07T20:27:13.8881828Z #define __SIZE_WIDTH__ 64
2025-05-07T20:27:13.8881949Z #define __BLKSIZE_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:27:13.8882029Z #define __SEG_FS 1
2025-05-07T20:27:13.8882260Z #define _IO_size_t size_t
2025-05-07T20:27:13.8882360Z #define __INT_LEAST16_MAX__ 0x7fff
2025-05-07T20:27:13.8882456Z #define INT_MIN (-INT_MAX - 1)
2025-05-07T20:27:13.8882540Z #define __stub_lchmod 
2025-05-07T20:27:13.8882638Z #define __DEC64_MANT_DIG__ 16
2025-05-07T20:27:13.8882742Z #define __INT64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:13.8882847Z #define _GLIBCXX_MANGLE_SIZE_T m
2025-05-07T20:27:13.8882926Z #define __SEG_GS 1
2025-05-07T20:27:13.8883108Z #define __FLT32_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F32
2025-05-07T20:27:13.8883199Z #define _IOS_APPEND 8
2025-05-07T20:27:13.8883433Z #define __SIG_ATOMIC_WIDTH__ 32
2025-05-07T20:27:13.8883525Z #define _GLIBCXX_RELEASE 11
2025-05-07T20:27:13.8883626Z #define _GLIBCXX98_USE_C99_WCHAR 1
2025-05-07T20:27:13.8883720Z #define _IO_IS_APPENDING 0x1000
2025-05-07T20:27:13.8883815Z #define __INT_LEAST64_TYPE__ long int
2025-05-07T20:27:13.8883905Z #define htole16(x) (x)
2025-05-07T20:27:13.8884017Z #define __TEXTURE_INDIRECT_FUNCTIONS_H__ 
2025-05-07T20:27:13.8884108Z #define _GLIBCXX_HAVE_FCNTL_H 1
2025-05-07T20:27:13.8884207Z #define __INT16_TYPE__ short int
2025-05-07T20:27:13.8884308Z #define __INT_LEAST8_TYPE__ signed char
2025-05-07T20:27:13.8884415Z #define __glibcxx_class_requires(_a,_b) 
2025-05-07T20:27:13.8884521Z #define __cpp_structured_bindings 201606L
2025-05-07T20:27:13.8884642Z #define __align__(n) __attribute__((aligned(n)))
2025-05-07T20:27:13.8884734Z #define __SIZEOF_INT__ 4
2025-05-07T20:27:13.8884821Z #define __WCLONE 0x80000000
2025-05-07T20:27:13.8884911Z #define __DEC32_MAX_EXP__ 97
2025-05-07T20:27:13.8885006Z #define SEEK_HOLE 4
2025-05-07T20:27:13.8885091Z #define TIMER_ABSTIME 1
2025-05-07T20:27:13.8885181Z #define __INT_FAST8_MAX__ 0x7f
2025-05-07T20:27:13.8885274Z #define __CUDA_MATH_CRTIMP 
2025-05-07T20:27:13.8885455Z #define __FLT128_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:27:13.8885597Z #define __INTPTR_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:13.8885735Z #define __DRIVER_FUNCTIONS_H__ 
2025-05-07T20:27:13.8885874Z #define __cpp_sized_deallocation 201309L
2025-05-07T20:27:13.8886009Z #define __MATH_FUNCTIONS_HPP__ 
2025-05-07T20:27:13.8886138Z #define __cpp_guaranteed_copy_elision 201606L
2025-05-07T20:27:13.8886223Z #define _LINUX_LIMITS_H 
2025-05-07T20:27:13.8886311Z #define linux 1
2025-05-07T20:27:13.8886397Z #define MOD_MICRO ADJ_MICRO
2025-05-07T20:27:13.8886504Z #define _GLIBCXX_DEBUG_ASSERT(_Condition) 
2025-05-07T20:27:13.8886604Z #define _GLIBCXX_HAVE_VSWSCANF 1
2025-05-07T20:27:13.8886696Z #define _GLIBCXX_HAVE_ISNAN 1
2025-05-07T20:27:13.8886804Z #define _XOPEN_IOV_MAX _POSIX_UIO_MAXIOV
2025-05-07T20:27:13.8886951Z #define __cudart_builtin__ __location__(cudart_builtin)
2025-05-07T20:27:13.8887045Z #define __cpp_lib_hypot 201603
2025-05-07T20:27:13.8887145Z #define __FLT64_HAS_QUIET_NAN__ 1
2025-05-07T20:27:13.8887239Z #define _GLIBCXX_HAVE_WCTYPE_H 1
2025-05-07T20:27:13.8887326Z #define MOD_NANO ADJ_NANO
2025-05-07T20:27:13.8887418Z #define htole64(x) (x)
2025-05-07T20:27:13.8887514Z #define FP_ILOGBNAN (-2147483647 - 1)
2025-05-07T20:27:13.8887636Z #define _IO_stdout ((_IO_FILE*)(&_IO_2_1_stdout_))
2025-05-07T20:27:13.8887734Z #define _IO_UPPERCASE 01000
2025-05-07T20:27:13.8888381Z #define cudaKernelNodeAttributeClusterSchedulingPolicyPreference cudaLaunchAttributeClusterSchedulingPolicyPreference
2025-05-07T20:27:13.8888506Z #define __USE_POSIX2 1
2025-05-07T20:27:13.8888659Z #define MOD_ESTERROR ADJ_ESTERROR
2025-05-07T20:27:13.8888786Z #define __WALL 0x40000000
2025-05-07T20:27:13.8888897Z #define _GLIBCXX_HAVE_LDEXPF 1
2025-05-07T20:27:13.8888991Z #define _XLOCALE_H 1
2025-05-07T20:27:13.8889084Z #define _GLIBCXX_USE_TMPNAM 1
2025-05-07T20:27:13.8889185Z #define __FLT32_MIN_10_EXP__ (-37)
2025-05-07T20:27:13.8889278Z #define __KEY_T_TYPE __S32_TYPE
2025-05-07T20:27:13.8889381Z #define __cudaGet_threadIdx() threadIdx
2025-05-07T20:27:13.8889475Z #define __EXCEPTIONS 1
2025-05-07T20:27:13.8889574Z #define __CUDART_API_PTSZ(api) api
2025-05-07T20:27:13.8889869Z #define __launch_bounds__(...) __annotate__(launch_bounds(__VA_ARGS__))
2025-05-07T20:27:13.8889960Z #define __WORDSIZE 64
2025-05-07T20:27:13.8890049Z #define CLOCK_MONOTONIC 1
2025-05-07T20:27:13.8890135Z #define _STL_RELOPS_H 1
2025-05-07T20:27:13.8890234Z #define __PTRDIFF_WIDTH__ 64
2025-05-07T20:27:13.8890328Z #define __BEGIN_DECLS extern "C" {
2025-05-07T20:27:13.8890424Z #define _GLIBCXX_HAVE_SYS_IPC_H 1
2025-05-07T20:27:13.8890528Z #define __LDBL_MANT_DIG__ 64
2025-05-07T20:27:13.8890625Z #define _GLIBCXX_HAVE_TRUNCATE 1
2025-05-07T20:27:13.8891012Z #define cudaKernelNodeAttributeClusterDimension cudaLaunchAttributeClusterDimension
2025-05-07T20:27:13.8891243Z #define _PSTL_GCC_VERSION (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__)
2025-05-07T20:27:13.8891370Z #define _GLIBCXX_NAMESPACE_CXX11 __cxx11::
2025-05-07T20:27:13.8891474Z #define _GLIBCXX_NUMERIC_LIMITS 1
2025-05-07T20:27:13.8891574Z #define __cpp_range_based_for 201603L
2025-05-07T20:27:13.8891687Z #define __cpp_lib_exchange_function 201304
2025-05-07T20:27:13.8891791Z #define _GLIBCXX_HAVE_INTTYPES_H 1
2025-05-07T20:27:13.8891894Z #define _GLIBCXX_DARWIN_USE_64_BIT_INODE 1
2025-05-07T20:27:13.8892079Z #define cudaCooperativeLaunchMultiDeviceNoPostSync 0x02
2025-05-07T20:27:13.8892175Z #define __FLT64_HAS_INFINITY__ 1
2025-05-07T20:27:13.8892264Z #define _GLIBCXX_CSTDLIB 1
2025-05-07T20:27:13.8892371Z #define _GLIBCXX_DEBUG_MACRO_SWITCH_H 1
2025-05-07T20:27:13.8892542Z #define __FLT64X_MAX__ 1.18973149535723176502126385303097021e+4932F64x
2025-05-07T20:27:13.8892653Z #define __STDCPP_DEFAULT_NEW_ALIGNMENT__ 16
2025-05-07T20:27:13.8892748Z #define _STRING_H 1
2025-05-07T20:27:13.8892844Z #define _BITS_PTHREADTYPES_H 1
2025-05-07T20:27:13.8892931Z #define _GCC_MAX_ALIGN_T 
2025-05-07T20:27:13.8893036Z #define __SM_32_INTRINSICS_HPP__ 
2025-05-07T20:27:13.8893167Z #define __SIG_ATOMIC_MIN__ (-__SIG_ATOMIC_MAX__ - 1)
2025-05-07T20:27:13.8893258Z #define __code_model_small__ 1
2025-05-07T20:27:13.8893357Z #define _PSTL_CONFIG_H 
2025-05-07T20:27:13.8893454Z #define __GCC_ATOMIC_LONG_LOCK_FREE 2
2025-05-07T20:27:13.8893575Z #define __cpp_nontype_template_args 201411L
2025-05-07T20:27:13.8893665Z #define __SM_20_INTRINSICS_H__ 
2025-05-07T20:27:13.8893763Z #define cudaCpuDeviceId ((int)-1)
2025-05-07T20:27:13.8894097Z #define assert(expr) ((expr) ? __ASSERT_VOID_CAST (0) : __assert_fail (__STRING(expr), __FILE__, __LINE__, __ASSERT_FUNCTION))
2025-05-07T20:27:13.8894188Z #define __DEC32_MANT_DIG__ 7
2025-05-07T20:27:13.8894270Z #define le64toh(x) (x)
2025-05-07T20:27:13.8894367Z #define FILENAME_MAX 4096
2025-05-07T20:27:13.8894515Z #define __iscntrl_l(c,l) __isctype_l((c), _IScntrl, (l))
2025-05-07T20:27:13.8894625Z #define __cpp_return_type_deduction 201304L
2025-05-07T20:27:13.8894712Z #define L_cuserid 9
2025-05-07T20:27:13.8894797Z #define __ino_t_defined 
2025-05-07T20:27:13.8894880Z #define __k8__ 1
2025-05-07T20:27:13.8894976Z #define __INTPTR_TYPE__ long int
2025-05-07T20:27:13.8895079Z #define __UINT16_TYPE__ short unsigned int
2025-05-07T20:27:13.8895178Z #define __int8_t_defined 
2025-05-07T20:27:13.8895266Z #define __WCHAR_TYPE__ int
2025-05-07T20:27:13.8895361Z #define __CLOCKID_T_TYPE __S32_TYPE
2025-05-07T20:27:13.8895477Z #define cudaHostRegisterPortable 0x01
2025-05-07T20:27:13.8895570Z #define __SLONGWORD_TYPE long int
2025-05-07T20:27:13.8895683Z #define _GLIBCXX_PACKAGE_TARNAME "libstdc++"
2025-05-07T20:27:13.8895834Z #define __isblank_l(c,l) __isctype_l((c), _ISblank, (l))
2025-05-07T20:27:13.8895919Z #define __HAVE_COLUMN 
2025-05-07T20:27:13.8896001Z #define __stub_fdetach 
2025-05-07T20:27:13.8896407Z #define __CUDACC_VER__ "__CUDACC_VER__ is no longer supported.  Use __CUDACC_VER_MAJOR__, __CUDACC_VER_MINOR__, and __CUDACC_VER_BUILD__ instead."
2025-05-07T20:27:13.8896486Z #define __pic__ 2
2025-05-07T20:27:13.8896615Z #define __UINTPTR_MAX__ 0xffffffffffffffffUL
2025-05-07T20:27:13.8896709Z #define CLOCKS_PER_SEC 1000000l
2025-05-07T20:27:13.8896797Z #define __INT_FAST64_WIDTH__ 64
2025-05-07T20:27:13.8896988Z #define _GLIBCXX_HAVE_SOCKATMARK 1
2025-05-07T20:27:13.8897070Z #define __stub_chflags 
2025-05-07T20:27:13.8897154Z #define CLOCK_BOOTTIME 7
2025-05-07T20:27:13.8897247Z #define __need_IOV_MAX 
2025-05-07T20:27:13.8897351Z #define putc(_ch,_fp) _IO_putc (_ch, _fp)
2025-05-07T20:27:13.8897449Z #define __UQUAD_TYPE unsigned long int
2025-05-07T20:27:13.8897550Z #define __cpp_decltype 200707L
2025-05-07T20:27:13.8897644Z #define __BYTE_ORDER __LITTLE_ENDIAN
2025-05-07T20:27:13.8897732Z #define _GLIBCXX_USE_C99 1
2025-05-07T20:27:13.8897840Z #define _GLIBCXX_TR1_BETA_FUNCTION_TCC 1
2025-05-07T20:27:13.8897995Z #define TTY_NAME_MAX 32
2025-05-07T20:27:13.8898165Z #define _GLIBCXX_FORWARD(_Tp,__val) std::forward<_Tp>(__val)
2025-05-07T20:27:13.8898296Z #define __INT_FAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:13.8898482Z #define _PSTL_ASSERT(_Condition) __glibcxx_assert(_Condition)
2025-05-07T20:27:13.8898607Z #define __GCC_ATOMIC_TEST_AND_SET_TRUEVAL 1
2025-05-07T20:27:13.8898703Z #define __LITTLE_ENDIAN 1234
2025-05-07T20:27:13.8898792Z #define STA_PPSTIME 0x0004
2025-05-07T20:27:13.8898878Z #define __import__ 
2025-05-07T20:27:13.8898963Z #define BUFSIZ _IO_BUFSIZ
2025-05-07T20:27:13.8899094Z #define M_SQRT2l 1.414213562373095048801688724209698079L
2025-05-07T20:27:13.8899183Z #define __export__ 
2025-05-07T20:27:13.8899297Z #define __FSID_T_TYPE struct { int __val[2]; }
2025-05-07T20:27:13.8899394Z #define cudaMemAttachHost 0x02
2025-05-07T20:27:13.8899556Z #define __FLT_NORM_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:27:13.8899649Z #define _GLIBCXX_HAVE_ICONV 1
2025-05-07T20:27:13.8899747Z #define _GLIBCXX_SYMVER 1
2025-05-07T20:27:13.8899838Z #define __FLT64X_MAX_EXP__ 16384
2025-05-07T20:27:13.8899925Z #define _WCHAR_T_DECLARED 
2025-05-07T20:27:13.8900050Z #define __UINT_FAST64_TYPE__ long unsigned int
2025-05-07T20:27:13.8900162Z #define isalpha_l(c,l) __isalpha_l ((c), (l))
2025-05-07T20:27:13.8900265Z #define __cpp_inline_variables 201606L
2025-05-07T20:27:13.8900363Z #define WNOWAIT 0x01000000
2025-05-07T20:27:13.8900444Z #define PLOSS 6
2025-05-07T20:27:13.8900532Z #define M_LN10 2.30258509299404568402
2025-05-07T20:27:13.8900791Z #define _PSTL_UDS_PRESENT (__INTEL_COMPILER >= 1900 && __INTEL_COMPILER_BUILD_DATE >= 20180626)
2025-05-07T20:27:13.8900875Z #define EXIT_SUCCESS 0
2025-05-07T20:27:13.8900972Z #define __LDBL_REDIR_DECL(name) 
2025-05-07T20:27:13.8901063Z #define _GLIBCXX_HAVE_STRTOF 1
2025-05-07T20:27:13.8901160Z #define MOD_FREQUENCY ADJ_FREQUENCY
2025-05-07T20:27:13.8901253Z #define __thread__ __thread
2025-05-07T20:27:13.8901345Z #define _GLIBCXX_HAVE_MEMORY_H 1
2025-05-07T20:27:13.8901437Z #define __INT_MAX__ 0x7fffffff
2025-05-07T20:27:13.8901540Z #define __SIZEOF_PTHREAD_BARRIER_T 32
2025-05-07T20:27:13.8901759Z #define __glibcxx_requires_partitioned_upper_pred(_First,_Last,_Value,_Pred) 
2025-05-07T20:27:13.8901868Z #define __cudaCDP2StreamWaitEvent_ptsz 
2025-05-07T20:27:13.8901963Z #define _GLIBCXX_HAVE_SINF 1
2025-05-07T20:27:13.8902044Z #define __linux__ 1
2025-05-07T20:27:13.8902136Z #define STA_PPSSIGNAL 0x0100
2025-05-07T20:27:13.8902265Z #define M_LN2l 0.693147180559945309417232121458176568L
2025-05-07T20:27:13.8902352Z #define __S16_TYPE short int
2025-05-07T20:27:13.8902693Z #define __glibcxx_constexpr_assert(cond) if (__builtin_is_constant_evaluated() && !bool(cond)) __builtin_unreachable()
2025-05-07T20:27:13.8902793Z #define __NVCC_DIAG_PRAGMA_SUPPORT__ 1
2025-05-07T20:27:13.8902976Z #define __bos(ptr) __builtin_object_size (ptr, __USE_FORTIFY_LEVEL > 1)
2025-05-07T20:27:13.8903075Z #define __COMMON_FUNCTIONS_H__ 
2025-05-07T20:27:13.8903172Z #define UINT_MAX (INT_MAX * 2U + 1U)
2025-05-07T20:27:13.8903252Z #define _T_SIZE_ 
2025-05-07T20:27:13.8903351Z #define LLONG_MAX __LONG_LONG_MAX__
2025-05-07T20:27:13.8903466Z #define __cudaCDP2StreamCreateWithFlags 
2025-05-07T20:27:13.8903556Z #define _PSTL_VERSION 12000
2025-05-07T20:27:13.8903677Z #define __noinline__ __attribute__((noinline))
2025-05-07T20:27:13.8903768Z #define __WNOTHREAD 0x20000000
2025-05-07T20:27:13.8903959Z #define _G_va_list __gnuc_va_list
2025-05-07T20:27:13.8904084Z #define M_PI_4l 0.785398163397448309615660845819875721L
2025-05-07T20:27:13.8904164Z #define _IOS_INPUT 1
2025-05-07T20:27:13.8904258Z #define __USE_LARGEFILE64 1
2025-05-07T20:27:13.8904357Z #define _GLIBCXX_TR1_EXP_INTEGRAL_TCC 1
2025-05-07T20:27:13.8904444Z #define __INT64_TYPE__ long int
2025-05-07T20:27:13.8904541Z #define _POSIX_SSIZE_MAX 32767
2025-05-07T20:27:13.8904635Z #define __shared__ __location__(shared)
2025-05-07T20:27:13.8904722Z #define __FLT_MAX_EXP__ 128
2025-05-07T20:27:13.8904958Z #define __glibc_unlikely(cond) __builtin_expect((cond), 0)
2025-05-07T20:27:13.8905045Z #define __gid_t_defined 
2025-05-07T20:27:13.8905160Z #define _GLIBCXX_USE_SC_NPROCESSORS_ONLN 1
2025-05-07T20:27:13.8905252Z #define __ORDER_BIG_ENDIAN__ 4321
2025-05-07T20:27:13.8905448Z #define __glibcxx_requires_can_increment_range(_First1,_Last1,_First2) 
2025-05-07T20:27:13.8905548Z #define _GLIBCXX17_INLINE inline
2025-05-07T20:27:13.8905638Z #define __DBL_MANT_DIG__ 53
2025-05-07T20:27:13.8905722Z #define ___int_size_t_h 
2025-05-07T20:27:13.8905826Z #define __FSBLKCNT64_T_TYPE __UQUAD_TYPE
2025-05-07T20:27:13.8905946Z #define __cpp_inheriting_constructors 201511L
2025-05-07T20:27:13.8906094Z #define __WIFCONTINUED(status) ((status) == __W_CONTINUED)
2025-05-07T20:27:13.8906199Z #define CUDA_DOUBLE_MATH_FUNCTIONS 1
2025-05-07T20:27:13.8906289Z #define _GLIBCXX_HAVE_FENV_H 1
2025-05-07T20:27:13.8906379Z #define _GLIBCXX_HAVE_STDBOOL_H 1
2025-05-07T20:27:13.8906475Z #define __SIZEOF_FLOAT128__ 16
2025-05-07T20:27:13.8906599Z #define __INT_LEAST64_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:13.8906715Z #define _GLIBCXX_TR1_HYPERGEOMETRIC_TCC 1
2025-05-07T20:27:13.8906829Z #define _GLIBCXX_DEBUG_PEDASSERT(_Condition) 
2025-05-07T20:27:13.8906917Z #define __clock_t_defined 1
2025-05-07T20:27:13.8907018Z #define _POSIX_SEM_VALUE_MAX 32767
2025-05-07T20:27:13.8907124Z #define __cudaCDP2RuntimeGetVersion 
2025-05-07T20:27:13.8907215Z #define __GLIBC_MINOR__ 17
2025-05-07T20:27:13.8907308Z #define __DEC64_MIN__ 1E-383DD
2025-05-07T20:27:13.8907402Z #define __WINT_TYPE__ unsigned int
2025-05-07T20:27:13.8907505Z #define __UINT_LEAST32_TYPE__ unsigned int
2025-05-07T20:27:13.8907736Z #define __SIZEOF_SHORT__ 2
2025-05-07T20:27:13.8907906Z #define __FLT32_NORM_MAX__ 3.40282346638528859811704183484516925e+38F32
2025-05-07T20:27:13.8907989Z #define __SSE__ 1
2025-05-07T20:27:13.8908081Z #define SEM_VALUE_MAX (2147483647)
2025-05-07T20:27:13.8908171Z #define M_SQRT1_2 0.70710678118654752440
2025-05-07T20:27:13.8908257Z #define _CTYPE_H 1
2025-05-07T20:27:13.8908351Z #define __sigset_t_defined 
2025-05-07T20:27:13.8908443Z #define __LDBL_MIN_EXP__ (-16381)
2025-05-07T20:27:13.8908537Z #define _GLIBCXX_HAVE_LOGF 1
2025-05-07T20:27:13.8908620Z #define MOD_TAI ADJ_TAI
2025-05-07T20:27:13.8908714Z #define _IO_va_list __gnuc_va_list
2025-05-07T20:27:13.8908809Z #define _GLIBCXX_HAVE_LOGL 1
2025-05-07T20:27:13.8908889Z #define __SM_70_RT_H__ 
2025-05-07T20:27:13.8908985Z #define _GLIBCXX_HAVE_WRITEV 1
2025-05-07T20:27:13.8909092Z #define cudaEventWaitDefault 0x00
2025-05-07T20:27:13.8909183Z #define _GLIBCXX_HAVE_EXPL 1
2025-05-07T20:27:13.8909343Z #define __FLT64_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:27:13.8909434Z #define _POSIX_MAX_CANON 255
2025-05-07T20:27:13.8909537Z #define _GLIBCXX_NOEXCEPT_PARM , bool _NE
2025-05-07T20:27:13.8909633Z #define FD_SETSIZE __FD_SETSIZE
2025-05-07T20:27:13.8909720Z #define _GLIBCXX_TXN_SAFE 
2025-05-07T20:27:13.8909800Z #define __amd64__ 1
2025-05-07T20:27:13.8909893Z #define __WINT_WIDTH__ 32
2025-05-07T20:27:13.8909999Z #define __CUDA_DEVICE_RUNTIME_API_H__ 
2025-05-07T20:27:13.8910259Z #define __REDIRECT_NTHNL(name,proto,alias) name proto __THROWNL __asm__ (__ASMNAME (#alias))
2025-05-07T20:27:13.8910361Z #define _GLIBCXX_STDIO_SEEK_CUR 1
2025-05-07T20:27:13.8910443Z #define EOF (-1)
2025-05-07T20:27:13.8910535Z #define __WAIT_STATUS_DEFN void *
2025-05-07T20:27:13.8910717Z #define __USE_POSIX199309 1
2025-05-07T20:27:13.8910808Z #define __INT_LEAST64_WIDTH__ 64
2025-05-07T20:27:13.8910905Z #define __LDBL_MAX_EXP__ 16384
2025-05-07T20:27:13.8910995Z #define __FLT32X_MAX_10_EXP__ 308
2025-05-07T20:27:13.8911088Z #define LLONG_MIN (-LLONG_MAX-1)
2025-05-07T20:27:13.8911203Z #define cudaSurfaceType2DLayered 0xF2
2025-05-07T20:27:13.8911293Z #define ____mbstate_t_defined 1
2025-05-07T20:27:13.8911376Z #define STA_NANO 0x2000
2025-05-07T20:27:13.8911473Z #define _GLIBCXX_HAVE_LOG10F 1
2025-05-07T20:27:13.8911564Z #define _GLIBCXX_HAVE_LOG10L 1
2025-05-07T20:27:13.8911646Z #define _IO_LINKED 0x80
2025-05-07T20:27:13.8911851Z #define __cpp_lib_launder 201606
2025-05-07T20:27:13.8911942Z #define __SIZEOF_INT128__ 16
2025-05-07T20:27:13.8912040Z #define __PTHREAD_MUTEX_HAVE_PREV 1
2025-05-07T20:27:13.8912135Z #define __FLT64X_IS_IEC_60559__ 2
2025-05-07T20:27:13.8912224Z #define _GLIBCXX_TYPE_TRAITS 1
2025-05-07T20:27:13.8912366Z #define cudaGraphKernelNodePortProgrammatic 1
2025-05-07T20:27:13.8912475Z #define __DEVICE_ATOMIC_FUNCTIONS_HPP__ 
2025-05-07T20:27:13.8912570Z #define __BLKCNT64_T_TYPE __SQUAD_TYPE
2025-05-07T20:27:13.8912668Z #define __LDBL_MAX_10_EXP__ 4932
2025-05-07T20:27:13.8912757Z #define __W_CONTINUED 0xffff
2025-05-07T20:27:13.8912843Z #define __ATOMIC_RELAXED 0
2025-05-07T20:27:13.8912974Z #define w_coredump __wait_terminated.__w_coredump
2025-05-07T20:27:13.8913092Z #define __FSBLKCNT_T_TYPE __SYSCALL_ULONG_TYPE
2025-05-07T20:27:13.8913289Z #define __cudaCDP2OccupancyMaxActiveBlocksPerMultiprocessor 
2025-05-07T20:27:13.8913475Z #define __DBL_EPSILON__ double(2.22044604925031308084726333618164062e-16L)
2025-05-07T20:27:13.8913562Z #define __stub_stty 
2025-05-07T20:27:13.8913728Z #define _tolower(c) ((int) (*__ctype_tolower_loc ())[(int) (c)])
2025-05-07T20:27:13.8913812Z #define le16toh(x) (x)
2025-05-07T20:27:13.8913914Z #define BC_SCALE_MAX _POSIX2_BC_SCALE_MAX
2025-05-07T20:27:13.8914088Z #define __FLT128_MIN__ 3.36210314311209350626267781732175260e-4932F128
2025-05-07T20:27:13.8914172Z #define _SIZET_ 
2025-05-07T20:27:13.8914258Z #define XATTR_NAME_MAX 255
2025-05-07T20:27:13.8914345Z #define _SVID_SOURCE 1
2025-05-07T20:27:13.8914423Z #define _LP64 1
2025-05-07T20:27:13.8914510Z #define _LIBC_LIMITS_H_ 1
2025-05-07T20:27:13.8914747Z #define __REDIRECT_NTH_LDBL(name,proto,alias) __REDIRECT_NTH (name, proto, alias)
2025-05-07T20:27:13.8914854Z #define _GLIBCXX_TR1_BESSEL_FUNCTION_TCC 1
2025-05-07T20:27:13.8914942Z #define __UINT8_C(c) c
2025-05-07T20:27:13.8915032Z #define _GLIBCXX_HAVE_CEILF 1
2025-05-07T20:27:13.8915121Z #define _GLIBCXX_HAVE_CEILL 1
2025-05-07T20:27:13.8915236Z #define __cudaCDP2Memset3DAsync_ptsz 
2025-05-07T20:27:13.8915327Z #define __CUDA_ARCH_LIST__ 520
2025-05-07T20:27:13.8915417Z #define __FLT64_MAX_EXP__ 1024
2025-05-07T20:27:13.8915514Z #define MOD_MAXERROR ADJ_MAXERROR
2025-05-07T20:27:13.8915599Z #define CUDARTAPI 
2025-05-07T20:27:13.8915679Z #define IOV_MAX 1024
2025-05-07T20:27:13.8915823Z #define __glibcxx_requires_irreflexive2(_First,_Last) 
2025-05-07T20:27:13.8915919Z #define __INT_LEAST32_TYPE__ int
2025-05-07T20:27:13.8916015Z #define P_tmpdir "/tmp"
2025-05-07T20:27:13.8916118Z #define cudaMemAttachSingle 0x04
2025-05-07T20:27:13.8916199Z #define __wchar_t__ 
2025-05-07T20:27:13.8916305Z #define __cpp_lib_is_aggregate 201703
2025-05-07T20:27:13.8916382Z #define SEEK_END 2
2025-05-07T20:27:13.8916469Z #define __SIZEOF_WCHAR_T__ 4
2025-05-07T20:27:13.8916643Z #define _GLIBCXX_USE_TBB_PAR_BACKEND __has_include(<tbb/tbb.h>)
2025-05-07T20:27:13.8916737Z #define _IO_ftrylockfile(_fp) 
2025-05-07T20:27:13.8916876Z #define _GLIBCXX_USE_C99_WCHAR _GLIBCXX11_USE_C99_WCHAR
2025-05-07T20:27:13.8916971Z #define ____FILE_defined 1
2025-05-07T20:27:13.8917083Z #define _GLIBCXX_HAVE_BUILTIN_IS_AGGREGATE 1
2025-05-07T20:27:13.8917176Z #define __GNUC_PATCHLEVEL__ 0
2025-05-07T20:27:13.8917265Z #define _ISOC99_SOURCE 1
2025-05-07T20:27:13.8917357Z #define __VECTOR_FUNCTIONS_H__ 
2025-05-07T20:27:13.8917596Z #define __REDIRECT_NTH(name,proto,alias) name proto __THROW __asm__ (__ASMNAME (#alias))
2025-05-07T20:27:13.8917812Z #define _PSTL_USE_NONTEMPORAL_STORES_IF_ALLOWED 
2025-05-07T20:27:13.8917891Z #define _IO_RIGHT 04
2025-05-07T20:27:13.8917987Z #define __END_NAMESPACE_STD 
2025-05-07T20:27:13.8918172Z #define __FLT128_NORM_MAX__ 1.18973149535723176508575932662800702e+4932F128
2025-05-07T20:27:13.8918259Z #define _GLIBCXX_STD_C std
2025-05-07T20:27:13.8918379Z #define cudaInitDeviceFlagsAreValid 0x01
2025-05-07T20:27:13.8918468Z #define _LARGEFILE64_SOURCE 1
2025-05-07T20:27:13.8918564Z #define _GLIBCXX_USE_C99_STDINT_TR1 1
2025-05-07T20:27:13.8918653Z #define _STDDEF_H_ 
2025-05-07T20:27:13.8918962Z #define __FLT64_NORM_MAX__ 1.79769313486231570814527423731704357e+308F64
2025-05-07T20:27:13.8919103Z #define __FLT128_HAS_QUIET_NAN__ 1
2025-05-07T20:27:13.8919230Z #define isalnum_l(c,l) __isalnum_l ((c), (l))
2025-05-07T20:27:13.8919422Z #define __FD_ISSET(d,set) ((__FDS_BITS (set)[__FD_ELT (d)] & __FD_MASK (d)) != 0)
2025-05-07T20:27:13.8919539Z #define __INTMAX_MAX__ 0x7fffffffffffffffL
2025-05-07T20:27:13.8919682Z #define __glibcxx_requires_irreflexive(_First,_Last) 
2025-05-07T20:27:13.8919799Z #define cudaGraphKernelNodePortDefault 0
2025-05-07T20:27:13.8919902Z #define __INT_FAST8_TYPE__ signed char
2025-05-07T20:27:13.8920009Z #define __cudaCDP2Memcpy3DAsync_ptsz 
2025-05-07T20:27:13.8920101Z #define __PID_T_TYPE __S32_TYPE
2025-05-07T20:27:13.8920217Z #define __cpp_namespace_attributes 201411L
2025-05-07T20:27:13.8920310Z #define CHARCLASS_NAME_MAX 2048
2025-05-07T20:27:13.8920404Z #define _GLIBCXX_HAVE_TANF 1
2025-05-07T20:27:13.8920501Z #define _GLIBCXX_USE_ST_MTIM 1
2025-05-07T20:27:13.8920678Z #define __FLT64X_MIN__ 3.36210314311209350626267781732175260e-4932F64x
2025-05-07T20:27:13.8920775Z #define __CUDA_RUNTIME_H__ 
2025-05-07T20:27:13.8920955Z #define WIFSIGNALED(status) __WIFSIGNALED (__WAIT_INT (status))
2025-05-07T20:27:13.8921051Z #define _GLIBCXX_HAVE_STDLIB_H 1
2025-05-07T20:27:13.8921150Z #define __STDCPP_THREADS__ 1
2025-05-07T20:27:13.8921293Z #define M_2_SQRTPIl 1.128379167095512573896158903121545172L
2025-05-07T20:27:13.8921384Z #define __GNUC_STDC_INLINE__ 1
2025-05-07T20:27:13.8921481Z #define _POSIX_UIO_MAXIOV 16
2025-05-07T20:27:13.8921578Z #define _PSTL_PAR_BACKEND_SERIAL 
2025-05-07T20:27:13.8921692Z #define __ASSERT_FUNCTION __PRETTY_FUNCTION__
2025-05-07T20:27:13.8921786Z #define __FLT64_HAS_DENORM__ 1
2025-05-07T20:27:13.8921881Z #define __WORDSIZE_TIME64_COMPAT32 1
2025-05-07T20:27:13.8922049Z #define _GLIBCXX_DEPRECATED __attribute__ ((__deprecated__))
2025-05-07T20:27:13.8922212Z #define __FLT32_EPSILON__ 1.19209289550781250000000000000000000e-7F32
2025-05-07T20:27:13.8922312Z #define _PSTL_HIDE_FROM_ABI_PUSH 
2025-05-07T20:27:13.8922433Z #define cudaStreamLegacy ((cudaStream_t)0x1)
2025-05-07T20:27:13.8922540Z #define _IO_cleanup_region_start(_fct,_fp) 
2025-05-07T20:27:13.8922637Z #define __location__(a) __annotate__(a)
2025-05-07T20:27:13.8922868Z #define __device_builtin_surface_type__ __location__(device_builtin_surface_type)
2025-05-07T20:27:13.8922967Z #define _POSIX2_BC_BASE_MAX 99
2025-05-07T20:27:13.8923080Z #define __cudaCDP2DeviceGetAttribute 
2025-05-07T20:27:13.8923197Z #define __DBL_DECIMAL_DIG__ 17
2025-05-07T20:27:13.8923286Z #define __STDC_UTF_32__ 1
2025-05-07T20:27:13.8923377Z #define __INT_FAST8_WIDTH__ 8
2025-05-07T20:27:13.8923475Z #define NAN (__builtin_nanf (""))
2025-05-07T20:27:13.8923568Z #define _POSIX_MQ_PRIO_MAX 32
2025-05-07T20:27:13.8923646Z #define __FXSR__ 1
2025-05-07T20:27:13.8923732Z #define _SIZE_T 
2025-05-07T20:27:13.8923829Z #define _GLIBCXX_USE_GETTIMEOFDAY 1
2025-05-07T20:27:13.8923943Z #define cudaHostRegisterReadOnly 0x08
2025-05-07T20:27:13.8924113Z #define __FLT32X_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:27:13.8924258Z #define __WIFSTOPPED(status) (((status) & 0xff) == 0x7f)
2025-05-07T20:27:13.8924349Z #define _IO_ssize_t __ssize_t
2025-05-07T20:27:13.8924455Z #define __ULONG32_TYPE unsigned int
2025-05-07T20:27:13.8924636Z #define __DBL_NORM_MAX__ double(1.79769313486231570814527423731704357e+308L)
2025-05-07T20:27:13.8924934Z #define cudaStreamGraphTailLaunch (cudaStream_t)0x0100000000000000
2025-05-07T20:27:13.8925021Z #define _GXX_NULLPTR_T 
2025-05-07T20:27:13.8925142Z #define __glibcxx_class_requires3(_a,_b,_c,_d) 
2025-05-07T20:27:13.8925230Z #define FOPEN_MAX 16
2025-05-07T20:27:13.8925315Z #define __BIG_ENDIAN 4321
2025-05-07T20:27:13.8925429Z #define __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__
2025-05-07T20:27:13.8925529Z #define __suseconds_t_defined 
2025-05-07T20:27:13.8925613Z #define __off_t_defined 
2025-05-07T20:27:13.8925697Z #define stderr stderr
2025-05-07T20:27:13.8925870Z #define M_LOG10E 0.43429448190325182765
2025-05-07T20:27:13.8925980Z #define __glibcxx_requires_string(_String) 
2025-05-07T20:27:13.8926081Z #define _GLIBCXX_HAVE_LDEXPL 1
2025-05-07T20:27:13.8926169Z #define __INTMAX_WIDTH__ 64
2025-05-07T20:27:13.8926573Z #define _PSTL_CPP14_2RANGE_MISMATCH_EQUAL_PRESENT (_MSC_VER >= 1900 || __cplusplus >= 201300L || __cpp_lib_robust_nonmodifying_seq_ops == 201304)
2025-05-07T20:27:13.8926674Z #define __mode_t_defined 
2025-05-07T20:27:13.8926758Z #define _GCC_SIZE_T 
2025-05-07T20:27:13.8926854Z #define __INO64_T_TYPE __UQUAD_TYPE
2025-05-07T20:27:13.8926958Z #define __cpp_runtime_arrays 198712L
2025-05-07T20:27:13.8927060Z #define __UINT64_TYPE__ long unsigned int
2025-05-07T20:27:13.8927151Z #define __USE_XOPEN2K8XSI 1
2025-05-07T20:27:13.8927246Z #define __UINT32_C(c) c ## U
2025-05-07T20:27:13.8927346Z #define __cpp_alias_templates 200704L
2025-05-07T20:27:13.8927446Z #define cudaHostAllocMapped 0x02
2025-05-07T20:27:13.8927552Z #define __DEVICE_LAUNCH_PARAMETERS_H__ 
2025-05-07T20:27:13.8927642Z #define _STL_ITERATOR_H 1
2025-05-07T20:27:13.8927727Z #define __size_t__ 
2025-05-07T20:27:13.8927853Z #define cudaStreamAttrID cudaLaunchAttributeID
2025-05-07T20:27:13.8927944Z #define _GLIBCXX_HAVE_ATANF 1
2025-05-07T20:27:13.8928052Z #define cudaEventRecordExternal 0x01
2025-05-07T20:27:13.8928199Z #define __isspace_l(c,l) __isctype_l((c), _ISspace, (l))
2025-05-07T20:27:13.8928297Z #define _IO_BUFSIZ _G_BUFSIZ
2025-05-07T20:27:13.8928490Z #define __FLT_DENORM_MIN__ 1.40129846432481707092372958328991613e-45F
2025-05-07T20:27:13.8928593Z #define _ENDIAN_H 1
2025-05-07T20:27:13.8928697Z #define __builtin_align__(a) __align__(a)
2025-05-07T20:27:13.8928794Z #define _GLIBCXX20_CONSTEXPR 
2025-05-07T20:27:13.8928889Z #define __NV_NO_HOST_COMPILER_CHECK 1
2025-05-07T20:27:13.8928974Z #define __try try
2025-05-07T20:27:13.8929070Z #define _GLIBCXX_HAVE_FINITE 1
2025-05-07T20:27:13.8929158Z #define __FLT128_IS_IEC_60559__ 2
2025-05-07T20:27:13.8929251Z #define __INT8_MAX__ 0x7f
2025-05-07T20:27:13.8929505Z #define cudaStreamGetCaptureInfo __CUDART_API_PTSZ(cudaStreamGetCaptureInfo_v2)
2025-05-07T20:27:13.8929591Z #define __LONG_WIDTH__ 64
2025-05-07T20:27:13.8929675Z #define __PIC__ 2
2025-05-07T20:27:13.8929782Z #define BC_STRING_MAX _POSIX2_BC_STRING_MAX
2025-05-07T20:27:13.8929897Z #define __UINT_FAST32_TYPE__ long unsigned int
2025-05-07T20:27:13.8930035Z #define FD_ISSET(fd,fdsetp) __FD_ISSET (fd, fdsetp)
2025-05-07T20:27:13.8930126Z #define _GLIBCXX_HAVE_FLOAT_H 1
2025-05-07T20:27:13.8930216Z #define _GLIBCXX_HAVE_ATANL 1
2025-05-07T20:27:13.8930405Z #define __FLT32X_NORM_MAX__ 1.79769313486231570814527423731704357e+308F32x
2025-05-07T20:27:13.8930502Z #define __DEVICE_FUNCTIONS_HPP__ 
2025-05-07T20:27:13.8930601Z #define __CHAR32_TYPE__ unsigned int
2025-05-07T20:27:13.8930688Z #define _IO_uid_t __uid_t
2025-05-07T20:27:13.8930781Z #define _GLIBCXX_HAVE_READLINK 1
2025-05-07T20:27:13.8930909Z #define __cudaCDP2EventRecordWithFlags_ptsz 
2025-05-07T20:27:13.8931002Z #define _CONCEPT_CHECK_H 1
2025-05-07T20:27:13.8931234Z #define __FLT_MAX__ 3.40282346638528859811704183484516925e+38F
2025-05-07T20:27:13.8931333Z #define _GLIBCXX_HAVE_NETINET_IN_H 1
2025-05-07T20:27:13.8931455Z #define _GLIBCXX_TR1_SPECIAL_FUNCTION_UTIL_H 1
2025-05-07T20:27:13.8931537Z #define LONG_BIT 64
2025-05-07T20:27:13.8931641Z #define __SIZEOF_PTHREAD_BARRIERATTR_T 4
2025-05-07T20:27:13.8931831Z #define _GLIBCXX_USE_ALLOCATOR_NEW 1
2025-05-07T20:27:13.8931958Z #define __cpp_lib_math_special_functions 201603L
2025-05-07T20:27:13.8932052Z #define __fsfilcnt_t_defined 
2025-05-07T20:27:13.8932145Z #define __blkcnt_t_defined 
2025-05-07T20:27:13.8932414Z #define cudaKernelNodeAttributeMemSyncDomain cudaLaunchAttributeMemSyncDomain
2025-05-07T20:27:13.8932507Z #define __USE_LARGEFILE 1
2025-05-07T20:27:13.8932602Z #define __cpp_constexpr 201603L
2025-05-07T20:27:13.8932695Z #define CUDART_VERSION 12080
2025-05-07T20:27:13.8932789Z #define NL_TEXTMAX INT_MAX
2025-05-07T20:27:13.8932889Z #define cudaDeviceMapHost 0x08
2025-05-07T20:27:13.8933052Z #define _GLIBCXX_CMATH 1
2025-05-07T20:27:13.8933252Z #define __attribute_format_arg__(x) __attribute__ ((__format_arg__ (x)))
2025-05-07T20:27:13.8933342Z #define __lldiv_t_defined 1
2025-05-07T20:27:13.8933422Z #define __SSE2__ 1
2025-05-07T20:27:13.8933505Z #define _IOLBF 1
2025-05-07T20:27:13.8933603Z #define _GLIBCXX_HAVE_SYS_TYPES_H 1
2025-05-07T20:27:13.8933702Z #define _GLIBCXX_HAVE_FLOORF 1
2025-05-07T20:27:13.8933808Z #define __cpp_deduction_guides 201703L
2025-05-07T20:27:13.8933899Z #define _GLIBCXX_HAVE_EXPF 1
2025-05-07T20:27:13.8934013Z #define __annotate__(a) __attribute__((a))
2025-05-07T20:27:13.8934100Z #define __INT32_TYPE__ int
2025-05-07T20:27:13.8934188Z #define __SIZEOF_DOUBLE__ 8
2025-05-07T20:27:13.8934298Z #define cudaDeviceSyncMemops 0x80
2025-05-07T20:27:13.8934396Z #define __cpp_exceptions 199711L
2025-05-07T20:27:13.8934487Z #define __FLT_MIN_10_EXP__ (-37)
2025-05-07T20:27:13.8934602Z #define cudaDeviceScheduleYield 0x02
2025-05-07T20:27:13.8934699Z #define _SYS_SYSMACROS_H 1
2025-05-07T20:27:13.8934816Z #define _GLIBCXX_TR1_LEGENDRE_FUNCTION_TCC 1
2025-05-07T20:27:13.8934979Z #define __FLT64_MIN__ 2.22507385850720138309023271733240406e-308F64
2025-05-07T20:27:13.8935075Z #define __INT_LEAST32_WIDTH__ 32
2025-05-07T20:27:13.8935180Z #define __SWORD_TYPE long int
2025-05-07T20:27:13.8935277Z #define __INTMAX_TYPE__ long int
2025-05-07T20:27:13.8935375Z #define _GLIBCXX11_USE_C99_MATH 1
2025-05-07T20:27:13.8935474Z #define __PTHREAD_SPINS 0, 0
2025-05-07T20:27:13.8935563Z #define _BITS_POSIX1_LIM_H 1
2025-05-07T20:27:13.8935844Z #define cudaStreamAttributeMemSyncDomainMap cudaLaunchAttributeMemSyncDomainMap
2025-05-07T20:27:13.8935943Z #define __DEC128_MAX_EXP__ 6145
2025-05-07T20:27:13.8936088Z #define math_errhandling (MATH_ERRNO | MATH_ERREXCEPT)
2025-05-07T20:27:13.8936169Z #define _T_SIZE 
2025-05-07T20:27:13.8936281Z #define cudaHostAllocDefault 0x00
2025-05-07T20:27:13.8936403Z #define _PSTL_PRAGMA_SIMD_EXCLUSIVE_SCAN(PRM) 
2025-05-07T20:27:13.8936530Z #define __va_arg_pack() __builtin_va_arg_pack ()
2025-05-07T20:27:13.8936627Z #define _POSIX_TIMER_MAX 32
2025-05-07T20:27:13.8936717Z #define _GLIBCXX_HAVE_TLS 1
2025-05-07T20:27:13.8936842Z #define _GLIBCXX_NOTHROW _GLIBCXX_USE_NOEXCEPT
2025-05-07T20:27:13.8936939Z #define __FLT32X_HAS_QUIET_NAN__ 1
2025-05-07T20:27:13.8937027Z #define __ATOMIC_CONSUME 1
2025-05-07T20:27:13.8937207Z #define __CUDA_ARCH_HAS_FEATURE__(_FEAT) __CUDA_ARCH_FEAT_ ##_FEAT
2025-05-07T20:27:13.8937295Z #define __GNUC_MINOR__ 4
2025-05-07T20:27:13.8937394Z #define __GLIBCXX_TYPE_INT_N_0 __int128
2025-05-07T20:27:13.8937491Z #define __INT_FAST16_WIDTH__ 64
2025-05-07T20:27:13.8937605Z #define __UINTMAX_MAX__ 0xffffffffffffffffUL
2025-05-07T20:27:13.8937684Z #define __PIE__ 2
2025-05-07T20:27:13.8937788Z #define LITTLE_ENDIAN __LITTLE_ENDIAN
2025-05-07T20:27:13.8937885Z #define _GLIBCXX_HAVE_INT64_T_LONG 1
2025-05-07T20:27:13.8938082Z #define __FLT32X_DENORM_MIN__ 4.94065645841246544176568792868221372e-324F32x
2025-05-07T20:27:13.8938299Z #define __intN_t(N,MODE) typedef int int ##N ##_t __attribute__ ((__mode__ (MODE)))
2025-05-07T20:27:13.8938392Z #define __nlink_t_defined 
2025-05-07T20:27:13.8938524Z #define _GLIBCXX17_DEPRECATED [[__deprecated__]]
2025-05-07T20:27:13.8938632Z #define _PSTL_STRING(x) _PSTL_STRING_AUX(x)
2025-05-07T20:27:13.8938716Z #define _XOPEN_LIM_H 1
2025-05-07T20:27:13.8939101Z #define __u_intN_t(N,MODE) typedef unsigned int u_int ##N ##_t __attribute__ ((__mode__ (MODE)))
2025-05-07T20:27:13.8939215Z #define __cpp_template_template_args 201611L
2025-05-07T20:27:13.8939314Z #define _GTHREAD_USE_MUTEX_TIMEDLOCK 1
2025-05-07T20:27:13.8939416Z #define BC_DIM_MAX _POSIX2_BC_DIM_MAX
2025-05-07T20:27:13.8939509Z #define __DBL_MAX_10_EXP__ 308
2025-05-07T20:27:13.8939598Z #define __FILE_defined 1
2025-05-07T20:27:13.8939771Z #define __LDBL_DENORM_MIN__ 3.64519953188247460252840593361941982e-4951L
2025-05-07T20:27:13.8939864Z #define _GLIBCXX_HAVE_SINCOS 1
2025-05-07T20:27:13.8939964Z #define __USE_XOPEN_EXTENDED 1
2025-05-07T20:27:13.8940637Z #define __cpp_lib_tuple_element_t 201402L
2025-05-07T20:27:13.8940873Z #define isascii_l(c,l) __isascii_l ((c), (l))
2025-05-07T20:27:13.8941055Z #define cudaInvalidDeviceId ((int)-2)
2025-05-07T20:27:13.8941191Z #define _GLIBCXX_HAVE_SYS_RESOURCE_H 1
2025-05-07T20:27:13.8941295Z #define __INT16_C(c) c
2025-05-07T20:27:13.8941392Z #define __U32_TYPE unsigned int
2025-05-07T20:27:13.8941495Z #define _GLIBCXX_HAVE_SYS_IOCTL_H 1
2025-05-07T20:27:13.8941614Z #define FD_CLR(fd,fdsetp) __FD_CLR (fd, fdsetp)
2025-05-07T20:27:13.8941698Z #define __STDC__ 1
2025-05-07T20:27:13.8941787Z #define _GLIBCXX_HAVE_VWSCANF 1
2025-05-07T20:27:13.8941889Z #define _GLIBCXX_HAVE_EXECINFO_H 1
2025-05-07T20:27:13.8941980Z #define _GLIBCXX_USE_REALPATH 1
2025-05-07T20:27:13.8942128Z #define __attribute_malloc__ __attribute__ ((__malloc__))
2025-05-07T20:27:13.8942227Z #define __FLT32X_DIG__ 15
2025-05-07T20:27:13.8942320Z #define _GLIBCXX_USE_C99_CTYPE_TR1 1
2025-05-07T20:27:13.8942418Z #define __PTRDIFF_TYPE__ long int
2025-05-07T20:27:13.8942533Z #define cudaArrayDeferredMapping 0x80
2025-05-07T20:27:13.8942640Z #define _GLIBCXX_END_NAMESPACE_CONTAINER 
2025-05-07T20:27:13.8942733Z #define USHRT_MAX (SHRT_MAX * 2 + 1)
2025-05-07T20:27:13.8942836Z #define __cpp_lib_is_swappable 201603
2025-05-07T20:27:13.8942916Z #define stdin stdin
2025-05-07T20:27:13.8943010Z #define __ino64_t_defined 
2025-05-07T20:27:13.8943098Z #define STA_CLK 0x8000
2025-05-07T20:27:13.8943189Z #define __clockid_t_defined 1
2025-05-07T20:27:13.8943337Z #define _GLIBCXX_NOEXCEPT_IF(...) noexcept(__VA_ARGS__)
2025-05-07T20:27:13.8943494Z #define __attribute_noinline__ __attribute__ ((__noinline__))
2025-05-07T20:27:13.8943592Z #define __cudaCDP2MemsetAsync 
2025-05-07T20:27:13.8943694Z #define _PSTL_PRAGMA_SIMD_SCAN(PRM) 
2025-05-07T20:27:13.8943794Z #define _GLIBCXX_BEGIN_NAMESPACE_LDBL 
2025-05-07T20:27:13.8943893Z #define _GLIBCXX_TR1_POLY_HERMITE_TCC 1
2025-05-07T20:27:13.8944097Z #define __FD_SET(d,set) ((void) (__FDS_BITS (set)[__FD_ELT (d)] |= __FD_MASK (d)))
2025-05-07T20:27:13.8944187Z #define __ATOMIC_SEQ_CST 5
2025-05-07T20:27:13.8944723Z #define __tobody(c,f,a,args) (__extension__ ({ int __res; if (sizeof (c) > 1) { if (__builtin_constant_p (c)) { int __c = (c); __res = __c < -128 || __c > 255 ? __c : (a)[__c]; } else __res = f args; } else __res = (a)[(int) (c)]; __res; }))
2025-05-07T20:27:13.8944815Z #define DOMAIN 1
2025-05-07T20:27:13.8944907Z #define M_LN2 0.69314718055994530942
2025-05-07T20:27:13.8944990Z #define __NVCC__ 1
2025-05-07T20:27:13.8945096Z #define __cudaCDP2Memset2DAsync 
2025-05-07T20:27:13.8945239Z #define __CLOCK_T_TYPE __SYSCALL_SLONG_TYPE
2025-05-07T20:27:13.8945380Z #define _PSTL_PRAGMA_SIMD_EARLYEXIT 
2025-05-07T20:27:13.8945502Z #define __throw_exception_again throw
2025-05-07T20:27:13.8945591Z #define M_SQRT2 1.41421356237309504880
2025-05-07T20:27:13.8945682Z #define __EXCEPTION_H 1
2025-05-07T20:27:13.8945775Z #define __FLT32X_MIN_10_EXP__ (-307)
2025-05-07T20:27:13.8945886Z #define HUGE_VAL (__builtin_huge_val())
2025-05-07T20:27:13.8946184Z #define cudaStreamAttributeAccessPolicyWindow cudaLaunchAttributeAccessPolicyWindow
2025-05-07T20:27:13.8946292Z #define __UINTPTR_TYPE__ long unsigned int
2025-05-07T20:27:13.8946396Z #define _GLIBCXX_INLINE_VERSION 0
2025-05-07T20:27:13.8946486Z #define _GLIBCXX_USE_INT128 1
2025-05-07T20:27:13.8946588Z #define __cpp_lib_bool_constant 201505
2025-05-07T20:27:13.8946847Z #define PTHREAD_KEYS_MAX 1024
2025-05-07T20:27:13.8946987Z #define __DEC64_SUBNORMAL_MIN__ 0.000000000000001E-383DD
2025-05-07T20:27:13.8947089Z #define __FSFILCNT64_T_TYPE __UQUAD_TYPE
2025-05-07T20:27:13.8947200Z #define _GLIBCXX_DOUBLE_IS_IEEE_BINARY64 1
2025-05-07T20:27:13.8947291Z #define __DEC128_MANT_DIG__ 34
2025-05-07T20:27:13.8947408Z #define __cpp_lib_tuples_by_type 201304
2025-05-07T20:27:13.8947501Z #define __LDBL_MIN_10_EXP__ (-4931)
2025-05-07T20:27:13.8947692Z #define __cpp_generic_lambdas 201304L
2025-05-07T20:27:13.8947833Z #define _GLIBCXX_THROW_OR_ABORT(_EXC) (throw (_EXC))
2025-05-07T20:27:13.8948004Z #define __useconds_t_defined 
2025-05-07T20:27:13.8948105Z #define _GLIBCXX_USE_SCHED_YIELD 1
2025-05-07T20:27:13.8948287Z #define __attribute_deprecated__ __attribute__ ((__deprecated__))
2025-05-07T20:27:13.8948432Z #define __cpp_lib_type_trait_variable_templates 201510L
2025-05-07T20:27:13.8948514Z #define __SSE_MATH__ 1
2025-05-07T20:27:13.8948615Z #define _IO_wint_t wint_t
2025-05-07T20:27:13.8948705Z #define __SIZEOF_LONG_LONG__ 8
2025-05-07T20:27:13.8948793Z #define _GLIBCXX_VERBOSE 1
2025-05-07T20:27:13.8948890Z #define _GLIBCXX_HAVE_ASINF 1
2025-05-07T20:27:13.8948999Z #define __cpp_user_defined_literals 200809L
2025-05-07T20:27:13.8949095Z #define _GLIBCXX_HAVE_ISINFL 1
2025-05-07T20:27:13.8949184Z #define _GLIBCXX_HAVE_ASINL 1
2025-05-07T20:27:13.8949264Z #define __USE_ATFILE 1
2025-05-07T20:27:13.8949362Z #define _POSIX_OPEN_MAX 20
2025-05-07T20:27:13.8949469Z #define _POSIX_LOGIN_NAME_MAX 9
2025-05-07T20:27:13.8949588Z #define _GCC_PTRDIFF_T 
2025-05-07T20:27:13.8949864Z #define cudaKernelNodeAttributePriority cudaLaunchAttributePriority
2025-05-07T20:27:13.8949994Z #define __FLT128_DECIMAL_DIG__ 36
2025-05-07T20:27:13.8950123Z #define _POSIX_THREAD_KEYS_MAX 128
2025-05-07T20:27:13.8950231Z #define __GCC_ATOMIC_LLONG_LOCK_FREE 2
2025-05-07T20:27:13.8950335Z #define __cpp_lib_array_constexpr 201803L
2025-05-07T20:27:13.8950420Z #define _STDLIB_H 1
2025-05-07T20:27:13.8950563Z #define __exctype(name) extern int name (int) __THROW
2025-05-07T20:27:13.8950654Z #define __FLT32_HAS_QUIET_NAN__ 1
2025-05-07T20:27:13.8950750Z #define __FLT_DECIMAL_DIG__ 9
2025-05-07T20:27:13.8950876Z #define __UINT_FAST16_MAX__ 0xffffffffffffffffUL
2025-05-07T20:27:13.8950979Z #define __SURFACE_INDIRECT_FUNCTIONS_H__ 
2025-05-07T20:27:13.8951082Z #define __SM_61_INTRINSICS_H__ 
2025-05-07T20:27:13.8951261Z #define _GLIBCXX_PACKAGE_STRING "package-unused version-unused"
2025-05-07T20:27:13.8951412Z #define __isxdigit_l(c,l) __isctype_l((c), _ISxdigit, (l))
2025-05-07T20:27:13.8951523Z #define __glibcxx_requires_nonempty() 
2025-05-07T20:27:13.8951637Z #define w_stopsig __wait_stopped.__w_stopsig
2025-05-07T20:27:13.8951730Z #define __ldiv_t_defined 1
2025-05-07T20:27:13.8951901Z #define __glibcxx_requires_irreflexive_pred(_First,_Last,_Pred) 
2025-05-07T20:27:13.8951991Z #define ___int_ptrdiff_t_h 
2025-05-07T20:27:13.8952159Z #define __LDBL_NORM_MAX__ 1.18973149535723176502126385303097021e+4932L
2025-05-07T20:27:13.8952262Z #define __cudaCDP2EventDestroy 
2025-05-07T20:27:13.8952350Z #define __HOST_DEFINES_H__ 
2025-05-07T20:27:13.8952452Z #define __GCC_ATOMIC_SHORT_LOCK_FREE 2
2025-05-07T20:27:13.8952549Z #define __SM_20_ATOMIC_FUNCTIONS_H__ 
2025-05-07T20:27:13.8952645Z #define _GLIBCXX_USE_NANOSLEEP 1
2025-05-07T20:27:13.8952731Z #define CUDART_CB 
2025-05-07T20:27:13.8952829Z #define BC_BASE_MAX _POSIX2_BC_BASE_MAX
2025-05-07T20:27:13.8952952Z #define _GLIBCXX_USE_C99_INTTYPES_WCHAR_T_TR1 1
2025-05-07T20:27:13.8953039Z #define MB_LEN_MAX 16
2025-05-07T20:27:13.8953260Z #define __glibcxx_requires_partitioned_lower_pred(_First,_Last,_Value,_Pred) 
2025-05-07T20:27:13.8953360Z #define _GLIBCXX11_USE_C99_WCHAR 1
2025-05-07T20:27:13.8953483Z #define _IO_peekc(_fp) _IO_peekc_unlocked (_fp)
2025-05-07T20:27:13.8953590Z #define _GLIBCXX_HAVE_AS_SYMVER_DIRECTIVE 1
2025-05-07T20:27:13.8953693Z #define _GLIBCXX_HAVE_UNISTD_H 1
2025-05-07T20:27:13.8953838Z #define __glibc_likely(cond) __builtin_expect((cond), 1)
2025-05-07T20:27:13.8954035Z #define __UINT_FAST8_TYPE__ unsigned char
2025-05-07T20:27:13.8954123Z #define _GNU_SOURCE 1
2025-05-07T20:27:13.8954206Z #define __stub_putmsg 
2025-05-07T20:27:13.8954288Z #define __CUDACC__ 1
2025-05-07T20:27:13.8954383Z #define __N(msgid) (msgid)
2025-05-07T20:27:13.8954464Z #define __P(args) args
2025-05-07T20:27:13.8954725Z #define cudaKernelNodeAttributeCooperative cudaLaunchAttributeCooperative
2025-05-07T20:27:13.8954824Z #define __cpp_init_captures 201304L
2025-05-07T20:27:13.8954925Z #define _GLIBCXX17_CONSTEXPR constexpr
2025-05-07T20:27:13.8955108Z #define __ATOMIC_ACQ_REL 4
2025-05-07T20:27:13.8955203Z #define __cpp_lib_as_const 201510
2025-05-07T20:27:13.8955283Z #define __WCHAR_T 
2025-05-07T20:27:13.8955382Z #define __ATOMIC_RELEASE 3
2025-05-07T20:27:13.8955473Z #define __fsblkcnt_t_defined 
2025-05-07T20:27:13.8955585Z #define __cudaCDP2EventCreateWithFlags 
2025-05-07T20:27:13.8955692Z #define __DEVICE_DOUBLE_FUNCTIONS_H__ 
2025-05-07T20:27:13.8955705Z 
2025-05-07T20:27:13.9163268Z 
2025-05-07T20:27:13.9163909Z + conda run -n build_binary nvcc --version
2025-05-07T20:27:13.9163925Z 
2025-05-07T20:27:15.8089610Z nvcc: NVIDIA (R) Cuda compiler driver
2025-05-07T20:27:15.8089968Z Copyright (c) 2005-2025 NVIDIA Corporation
2025-05-07T20:27:15.8090268Z Built on Wed_Jan_15_19:20:09_PST_2025
2025-05-07T20:27:15.8090563Z Cuda compilation tools, release 12.8, V12.8.61
2025-05-07T20:27:15.8090893Z Build cuda_12.8.r12.8/compiler.35404655_0
2025-05-07T20:27:15.8091099Z 
2025-05-07T20:27:15.8707207Z 
2025-05-07T20:27:15.8716686Z /usr/bin/nvidia-smi
2025-05-07T20:27:15.8722492Z + nvidia-smi
2025-05-07T20:27:15.8722696Z 
2025-05-07T20:27:15.8892435Z Wed May  7 20:27:15 2025       
2025-05-07T20:27:15.8892898Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:27:15.8893393Z | NVIDIA-SMI 570.133.07             Driver Version: 570.133.07     CUDA Version: 12.8     |
2025-05-07T20:27:15.8893899Z |-----------------------------------------+------------------------+----------------------+
2025-05-07T20:27:15.8894372Z | GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
2025-05-07T20:27:15.8894886Z | Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
2025-05-07T20:27:15.8895300Z |                                         |                        |               MIG M. |
2025-05-07T20:27:15.8895626Z |=========================================+========================+======================|
2025-05-07T20:27:15.9072505Z |   0  NVIDIA A10G                    On  |   00000000:00:1E.0 Off |                    0 |
2025-05-07T20:27:15.9073954Z |  0%   25C    P8             16W /  300W |       0MiB /  23028MiB |      0%      Default |
2025-05-07T20:27:15.9075014Z |                                         |                        |                  N/A |
2025-05-07T20:27:15.9076078Z +-----------------------------------------+------------------------+----------------------+
2025-05-07T20:27:15.9077174Z                                                                                          
2025-05-07T20:27:15.9077572Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:27:15.9080982Z | Processes:                                                                              |
2025-05-07T20:27:15.9081581Z |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
2025-05-07T20:27:15.9082149Z |        ID   ID                                                               Usage      |
2025-05-07T20:27:15.9082615Z |=========================================================================================|
2025-05-07T20:27:15.9083184Z |  No running processes found                                                             |
2025-05-07T20:27:15.9083806Z +-----------------------------------------------------------------------------------------+
2025-05-07T20:27:16.1634811Z 
2025-05-07T20:27:16.1640532Z [INSTALL] Successfully installed CUDA 12.8.0
2025-05-07T20:27:16.1695522Z ##[group]Run . $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.8.0
2025-05-07T20:27:16.1696059Z [36;1m. $PRELUDE; install_pytorch_pip $BUILD_ENV nightly cuda/12.8.0[0m
2025-05-07T20:27:16.1708661Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:27:16.1709009Z env:
2025-05-07T20:27:16.1709255Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:27:16.1709591Z   BUILD_ENV: build_binary
2025-05-07T20:27:16.1709842Z   BUILD_TARGET: genai
2025-05-07T20:27:16.1710075Z   BUILD_VARIANT: cuda
2025-05-07T20:27:16.1710306Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:27:16.1710563Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:27:16.1710867Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:27:16.1711197Z ##[endgroup]
2025-05-07T20:27:16.5076311Z ################################################################################
2025-05-07T20:27:16.5076661Z # Install PyTorch (PIP)
2025-05-07T20:27:16.5076917Z #
2025-05-07T20:27:16.5092889Z # [2025-05-07T20:27:16.508Z] + install_pytorch_pip build_binary nightly cuda/12.8.0
2025-05-07T20:27:16.5093312Z ################################################################################
2025-05-07T20:27:16.5093529Z 
2025-05-07T20:27:16.5122076Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y numpy
2025-05-07T20:27:17.4967315Z Channels:
2025-05-07T20:27:17.4967551Z  - conda-forge
2025-05-07T20:27:17.4967765Z Platform: linux-64
2025-05-07T20:27:20.7474283Z Collecting package metadata (repodata.json): - \ | / - done
2025-05-07T20:27:21.4572314Z Solving environment: | / - done
2025-05-07T20:27:21.6755119Z 
2025-05-07T20:27:21.6755633Z ## Package Plan ##
2025-05-07T20:27:21.6755844Z 
2025-05-07T20:27:21.6756116Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:27:21.6756526Z 
2025-05-07T20:27:21.6756643Z   added / updated specs:
2025-05-07T20:27:21.6756989Z     - numpy
2025-05-07T20:27:21.6757130Z 
2025-05-07T20:27:21.6757152Z 
2025-05-07T20:27:21.6757303Z The following packages will be downloaded:
2025-05-07T20:27:21.6757575Z 
2025-05-07T20:27:21.6757713Z     package                    |            build
2025-05-07T20:27:21.6758030Z     ---------------------------|-----------------
2025-05-07T20:27:21.6758405Z     libblas-3.9.0              |31_h59b9bed_openblas          16 KB  conda-forge
2025-05-07T20:27:21.6758927Z     libcblas-3.9.0             |31_he106b2a_openblas          16 KB  conda-forge
2025-05-07T20:27:21.6759536Z     libgfortran-15.1.0         |       h69a702a_2          34 KB  conda-forge
2025-05-07T20:27:21.6760131Z     libgfortran5-15.1.0        |       hcea5267_2         1.5 MB  conda-forge
2025-05-07T20:27:21.6760657Z     liblapack-3.9.0            |31_h7ac8fdf_openblas          16 KB  conda-forge
2025-05-07T20:27:21.6761116Z     libopenblas-0.3.29         |pthreads_h94d23a6_0         5.6 MB  conda-forge
2025-05-07T20:27:21.6761569Z     numpy-2.2.5                |  py313h17eae1a_0         8.1 MB  conda-forge
2025-05-07T20:27:21.6761950Z     ------------------------------------------------------------
2025-05-07T20:27:21.6762274Z                                            Total:        15.4 MB
2025-05-07T20:27:21.6762481Z 
2025-05-07T20:27:21.6762602Z The following NEW packages will be INSTALLED:
2025-05-07T20:27:21.6762817Z 
2025-05-07T20:27:21.6763023Z   libblas            conda-forge/linux-64::libblas-3.9.0-31_h59b9bed_openblas 
2025-05-07T20:27:21.6763517Z   libcblas           conda-forge/linux-64::libcblas-3.9.0-31_he106b2a_openblas 
2025-05-07T20:27:21.6764009Z   libgfortran        conda-forge/linux-64::libgfortran-15.1.0-h69a702a_2 
2025-05-07T20:27:21.6764488Z   libgfortran5       conda-forge/linux-64::libgfortran5-15.1.0-hcea5267_2 
2025-05-07T20:27:21.6764988Z   liblapack          conda-forge/linux-64::liblapack-3.9.0-31_h7ac8fdf_openblas 
2025-05-07T20:27:21.6765514Z   libopenblas        conda-forge/linux-64::libopenblas-0.3.29-pthreads_h94d23a6_0 
2025-05-07T20:27:21.6766547Z   numpy              conda-forge/linux-64::numpy-2.2.5-py313h17eae1a_0 
2025-05-07T20:27:21.6766823Z 
2025-05-07T20:27:21.6766828Z 
2025-05-07T20:27:21.6766832Z 
2025-05-07T20:27:21.6766983Z Downloading and Extracting Packages: ...working...
2025-05-07T20:27:21.6767486Z numpy-2.2.5          | 8.1 MB    |            |   0% 
2025-05-07T20:27:21.6767786Z 
2025-05-07T20:27:21.6768291Z libopenblas-0.3.29   | 5.6 MB    |            |   0% [A
2025-05-07T20:27:21.6768601Z 
2025-05-07T20:27:21.6768605Z 
2025-05-07T20:27:21.6777301Z libgfortran5-15.1.0  | 1.5 MB    |            |   0% [A[A
2025-05-07T20:27:21.6777649Z 
2025-05-07T20:27:21.6777655Z 
2025-05-07T20:27:21.6777660Z 
2025-05-07T20:27:21.6809643Z libgfortran-15.1.0   | 34 KB     |            |   0% [A[A[A
2025-05-07T20:27:21.6809996Z 
2025-05-07T20:27:21.6810001Z 
2025-05-07T20:27:21.6810007Z 
2025-05-07T20:27:21.6811855Z 
2025-05-07T20:27:21.6819576Z libblas-3.9.0        | 16 KB     |            |   0% [A[A[A[A
2025-05-07T20:27:21.6819857Z 
2025-05-07T20:27:21.6819869Z 
2025-05-07T20:27:21.6819873Z 
2025-05-07T20:27:21.6819877Z 
2025-05-07T20:27:21.6824068Z 
2025-05-07T20:27:21.6832967Z libcblas-3.9.0       | 16 KB     |            |   0% [A[A[A[A[A
2025-05-07T20:27:21.6833222Z 
2025-05-07T20:27:21.6833229Z 
2025-05-07T20:27:21.6833233Z 
2025-05-07T20:27:21.6833237Z 
2025-05-07T20:27:21.6833240Z 
2025-05-07T20:27:21.6834060Z 
2025-05-07T20:27:21.7639202Z liblapack-3.9.0      | 16 KB     |            |   0% [A[A[A[A[A[A
2025-05-07T20:27:21.7639491Z 
2025-05-07T20:27:21.7639495Z 
2025-05-07T20:27:21.7639499Z 
2025-05-07T20:27:21.7643155Z 
2025-05-07T20:27:21.8370901Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:27:21.8371166Z 
2025-05-07T20:27:21.8371170Z 
2025-05-07T20:27:21.8371174Z 
2025-05-07T20:27:21.8371190Z 
2025-05-07T20:27:21.8371194Z 
2025-05-07T20:27:21.8397664Z libcblas-3.9.0       | 16 KB     | #########7 |  98% [A[A[A[A[A
2025-05-07T20:27:21.8397980Z 
2025-05-07T20:27:21.8397984Z 
2025-05-07T20:27:21.8397994Z 
2025-05-07T20:27:21.8398006Z 
2025-05-07T20:27:21.8399596Z 
2025-05-07T20:27:21.9134481Z libcblas-3.9.0       | 16 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:27:21.9134769Z 
2025-05-07T20:27:21.9134773Z 
2025-05-07T20:27:21.9134786Z 
2025-05-07T20:27:21.9134789Z 
2025-05-07T20:27:21.9134793Z 
2025-05-07T20:27:21.9138106Z 
2025-05-07T20:27:21.9219483Z liblapack-3.9.0      | 16 KB     | #########7 |  98% [A[A[A[A[A[A
2025-05-07T20:27:21.9219757Z 
2025-05-07T20:27:21.9219969Z 
2025-05-07T20:27:21.9219983Z 
2025-05-07T20:27:21.9219994Z 
2025-05-07T20:27:21.9220004Z 
2025-05-07T20:27:21.9222988Z 
2025-05-07T20:27:21.9899102Z liblapack-3.9.0      | 16 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:27:21.9900103Z 
2025-05-07T20:27:22.0579436Z libopenblas-0.3.29   | 5.6 MB    |            |   0% [A
2025-05-07T20:27:22.0579802Z 
2025-05-07T20:27:22.0579808Z 
2025-05-07T20:27:22.0579813Z 
2025-05-07T20:27:22.0650055Z libgfortran-15.1.0   | 34 KB     | ####7      |  47% [A[A[A
2025-05-07T20:27:22.0650426Z 
2025-05-07T20:27:22.0650433Z 
2025-05-07T20:27:22.0652599Z 
2025-05-07T20:27:22.0932864Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:27:22.0933578Z 
2025-05-07T20:27:22.0933590Z 
2025-05-07T20:27:22.0954578Z libgfortran5-15.1.0  | 1.5 MB    | 1          |   1% [A[A
2025-05-07T20:27:22.0954935Z 
2025-05-07T20:27:22.0964322Z libopenblas-0.3.29   | 5.6 MB    | ##5        |  26% [A
2025-05-07T20:27:22.0964664Z 
2025-05-07T20:27:22.0964670Z 
2025-05-07T20:27:22.0964676Z 
2025-05-07T20:27:22.0966877Z 
2025-05-07T20:27:22.0970345Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:27:22.0970704Z 
2025-05-07T20:27:22.0970710Z 
2025-05-07T20:27:22.0970715Z 
2025-05-07T20:27:22.0970720Z 
2025-05-07T20:27:22.0970726Z 
2025-05-07T20:27:22.0973192Z libcblas-3.9.0       | 16 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:27:22.0973550Z 
2025-05-07T20:27:22.0973789Z 
2025-05-07T20:27:22.0973794Z 
2025-05-07T20:27:22.0973997Z 
2025-05-07T20:27:22.1000092Z libblas-3.9.0        | 16 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:27:22.1000453Z 
2025-05-07T20:27:22.1000459Z 
2025-05-07T20:27:22.1000465Z 
2025-05-07T20:27:22.1000470Z 
2025-05-07T20:27:22.1000475Z 
2025-05-07T20:27:22.1000778Z 
2025-05-07T20:27:22.1151435Z liblapack-3.9.0      | 16 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:27:22.1151814Z 
2025-05-07T20:27:22.1151828Z 
2025-05-07T20:27:22.1200776Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:27:22.1273014Z numpy-2.2.5          | 8.1 MB    |            |   0% 
2025-05-07T20:27:22.1273341Z 
2025-05-07T20:27:22.1273347Z 
2025-05-07T20:27:22.1273385Z 
2025-05-07T20:27:22.1729702Z libgfortran-15.1.0   | 34 KB     | ########## | 100% [A[A[A
2025-05-07T20:27:22.1730076Z 
2025-05-07T20:27:22.1730082Z 
2025-05-07T20:27:22.1955976Z libgfortran5-15.1.0  | 1.5 MB    | ########## | 100% [A[A
2025-05-07T20:27:22.1956360Z 
2025-05-07T20:27:22.1956673Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:27:22.1957019Z 
2025-05-07T20:27:22.2137886Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:27:22.3326086Z numpy-2.2.5          | 8.1 MB    | ########## | 100% 
2025-05-07T20:27:22.3326423Z 
2025-05-07T20:27:22.6245383Z libopenblas-0.3.29   | 5.6 MB    | ########## | 100% [A
2025-05-07T20:27:22.6245905Z numpy-2.2.5          | 8.1 MB    | ########## | 100% 
2025-05-07T20:27:22.6253309Z numpy-2.2.5          | 8.1 MB    | ########## | 100% 
2025-05-07T20:27:22.6253776Z                                                      
2025-05-07T20:27:22.6254036Z 
2025-05-07T20:27:22.6254290Z                                                      [A
2025-05-07T20:27:22.6254560Z 
2025-05-07T20:27:22.6254565Z 
2025-05-07T20:27:22.6254781Z                                                      [A[A
2025-05-07T20:27:22.6255058Z 
2025-05-07T20:27:22.6255062Z 
2025-05-07T20:27:22.6255078Z 
2025-05-07T20:27:22.6255274Z                                                      [A[A[A
2025-05-07T20:27:22.6255481Z 
2025-05-07T20:27:22.6255485Z 
2025-05-07T20:27:22.6255489Z 
2025-05-07T20:27:22.6255493Z 
2025-05-07T20:27:22.6255715Z                                                      [A[A[A[A
2025-05-07T20:27:22.6256011Z 
2025-05-07T20:27:22.6256017Z 
2025-05-07T20:27:22.6256022Z 
2025-05-07T20:27:22.6256027Z 
2025-05-07T20:27:22.6256033Z 
2025-05-07T20:27:22.6256266Z                                                      [A[A[A[A[A
2025-05-07T20:27:22.6256549Z 
2025-05-07T20:27:22.6256554Z 
2025-05-07T20:27:22.6256569Z 
2025-05-07T20:27:22.6256574Z 
2025-05-07T20:27:22.6256579Z 
2025-05-07T20:27:22.6256585Z 
2025-05-07T20:27:22.6256791Z                                                      [A[A[A[A[A[A done
2025-05-07T20:27:22.7261903Z Preparing transaction: | done
2025-05-07T20:27:22.9268411Z Verifying transaction: - \ done
2025-05-07T20:27:23.0277075Z Executing transaction: / done
2025-05-07T20:27:23.2060844Z ################################################################################
2025-05-07T20:27:23.2061420Z # Install Package From PyTorch PIP: torch
2025-05-07T20:27:23.2061861Z #
2025-05-07T20:27:23.2076390Z # [2025-05-07T20:27:23.207Z] + install_from_pytorch_pip build_binary torch nightly cuda/12.8.0
2025-05-07T20:27:23.2077149Z ################################################################################
2025-05-07T20:27:23.2077484Z 
2025-05-07T20:27:23.2092066Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:27:23.2998465Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:27:23.2998942Z ################################################################################
2025-05-07T20:27:23.2999386Z # Prepare PIP Arguments (PyTorch PIP)
2025-05-07T20:27:23.2999671Z #
2025-05-07T20:27:23.3016234Z # [2025-05-07T20:27:23.301Z] + __prepare_pip_arguments torch nightly cuda/12.8.0
2025-05-07T20:27:23.3016807Z ################################################################################
2025-05-07T20:27:23.3017562Z 
2025-05-07T20:27:23.3038628Z [INSTALL] Extracted package (channel, version): (nightly, LATEST)
2025-05-07T20:27:23.3065497Z [INSTALL] Extracted package variant: cu128
2025-05-07T20:27:23.3082822Z [INSTALL] Using a non-RELEASE channel: nightly ...
2025-05-07T20:27:23.3083553Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/cu128/
2025-05-07T20:27:23.3092400Z [INSTALL] Extracted the full PIP package: --pre torch
2025-05-07T20:27:23.3100302Z [INSTALL] Attempting to install [torch, LATEST] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/cu128/ ...
2025-05-07T20:27:23.3121230Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128/
2025-05-07T20:29:00.7023278Z   DEPRECATION: Building 'MarkupSafe' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'MarkupSafe'. Discussion can be found at https://github.com/pypa/pip/issues/6334
2025-05-07T20:29:00.7025999Z Looking in indexes: https://download.pytorch.org/whl/nightly/cu128/
2025-05-07T20:29:00.7026461Z 
2025-05-07T20:29:00.7026584Z Collecting torch
2025-05-07T20:29:00.7027383Z   Downloading https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250507%2Bcu128-cp313-cp313-manylinux_2_28_x86_64.whl.metadata (30 kB)
2025-05-07T20:29:00.7028534Z Collecting filelock (from torch)
2025-05-07T20:29:00.7029230Z   Downloading https://download.pytorch.org/whl/nightly/filelock-3.16.1-py3-none-any.whl (16 kB)
2025-05-07T20:29:00.7030641Z Requirement already satisfied: typing-extensions>=4.10.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from torch) (4.13.2)
2025-05-07T20:29:00.7031933Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from torch) (78.1.1)
2025-05-07T20:29:00.7032591Z Collecting sympy>=1.13.3 (from torch)
2025-05-07T20:29:00.7033084Z   Downloading https://download.pytorch.org/whl/nightly/sympy-1.13.3-py3-none-any.whl (6.2 MB)
2025-05-07T20:29:00.7033918Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.2/6.2 MB 41.8 MB/s eta 0:00:00
2025-05-07T20:29:00.7034266Z Collecting networkx (from torch)
2025-05-07T20:29:00.7034761Z   Downloading https://download.pytorch.org/whl/nightly/networkx-3.4.2-py3-none-any.whl (1.7 MB)
2025-05-07T20:29:00.7035392Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 18.8 MB/s eta 0:00:00
2025-05-07T20:29:00.7035731Z Collecting jinja2 (from torch)
2025-05-07T20:29:00.7036201Z   Downloading https://download.pytorch.org/whl/nightly/jinja2-3.1.4-py3-none-any.whl (133 kB)
2025-05-07T20:29:00.7036693Z Collecting fsspec (from torch)
2025-05-07T20:29:00.7037194Z   Downloading https://download.pytorch.org/whl/nightly/fsspec-2024.10.0-py3-none-any.whl (179 kB)
2025-05-07T20:29:00.7037760Z Collecting nvidia-cuda-nvrtc-cu12==12.8.61 (from torch)
2025-05-07T20:29:00.7038616Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_nvrtc_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.7 kB)
2025-05-07T20:29:00.7039432Z Collecting nvidia-cuda-runtime-cu12==12.8.57 (from torch)
2025-05-07T20:29:00.7040658Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_runtime_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.7 kB)
2025-05-07T20:29:00.7041521Z Collecting nvidia-cuda-cupti-cu12==12.8.57 (from torch)
2025-05-07T20:29:00.7042312Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_cupti_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.7 kB)
2025-05-07T20:29:00.7043092Z Collecting nvidia-cudnn-cu12==9.8.0.87 (from torch)
2025-05-07T20:29:00.7044407Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cudnn_cu12-9.8.0.87-py3-none-manylinux_2_27_x86_64.whl.metadata (1.8 kB)
2025-05-07T20:29:00.7045124Z Collecting nvidia-cublas-cu12==12.8.3.14 (from torch)
2025-05-07T20:29:00.7045862Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cublas_cu12-12.8.3.14-py3-none-manylinux_2_27_x86_64.whl.metadata (1.7 kB)
2025-05-07T20:29:00.7046551Z Collecting nvidia-cufft-cu12==11.3.3.41 (from torch)
2025-05-07T20:29:00.7047325Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufft_cu12-11.3.3.41-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB)
2025-05-07T20:29:00.7048135Z Collecting nvidia-curand-cu12==10.3.9.55 (from torch)
2025-05-07T20:29:00.7048835Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_curand_cu12-10.3.9.55-py3-none-manylinux_2_27_x86_64.whl.metadata (1.5 kB)
2025-05-07T20:29:00.7049544Z Collecting nvidia-cusolver-cu12==11.7.2.55 (from torch)
2025-05-07T20:29:00.7050261Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusolver_cu12-11.7.2.55-py3-none-manylinux_2_27_x86_64.whl.metadata (1.6 kB)
2025-05-07T20:29:00.7051096Z Collecting nvidia-cusparse-cu12==12.5.7.53 (from torch)
2025-05-07T20:29:00.7051930Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparse_cu12-12.5.7.53-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB)
2025-05-07T20:29:00.7052724Z Collecting nvidia-cusparselt-cu12==0.6.3 (from torch)
2025-05-07T20:29:00.7053435Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl.metadata (6.8 kB)
2025-05-07T20:29:00.7054145Z Collecting nvidia-nccl-cu12==2.26.2 (from torch)
2025-05-07T20:29:00.7054896Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB)
2025-05-07T20:29:00.7055670Z Collecting nvidia-nvtx-cu12==12.8.55 (from torch)
2025-05-07T20:29:00.7056420Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvtx_cu12-12.8.55-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB)
2025-05-07T20:29:00.7057193Z Collecting nvidia-nvjitlink-cu12==12.8.61 (from torch)
2025-05-07T20:29:00.7058013Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvjitlink_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.7 kB)
2025-05-07T20:29:00.7058806Z Collecting nvidia-cufile-cu12==1.13.0.11 (from torch)
2025-05-07T20:29:00.7059623Z   Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufile_cu12-1.13.0.11-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB)
2025-05-07T20:29:00.7060418Z Collecting pytorch-triton==3.3.0+git96316ce5 (from torch)
2025-05-07T20:29:00.7061244Z   Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.6 kB)
2025-05-07T20:29:00.7062050Z Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch)
2025-05-07T20:29:00.7062604Z   Downloading https://download.pytorch.org/whl/nightly/mpmath-1.3.0-py3-none-any.whl (536 kB)
2025-05-07T20:29:00.7063315Z      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 5.6 MB/s eta 0:00:00
2025-05-07T20:29:00.7063683Z Collecting MarkupSafe>=2.0 (from jinja2->torch)
2025-05-07T20:29:00.7064173Z   Downloading https://download.pytorch.org/whl/nightly/MarkupSafe-2.1.5.tar.gz (19 kB)
2025-05-07T20:29:00.7064652Z   Preparing metadata (setup.py): started
2025-05-07T20:29:00.7065027Z   Preparing metadata (setup.py): finished with status 'done'
2025-05-07T20:29:00.7065772Z Downloading https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250507%2Bcu128-cp313-cp313-manylinux_2_28_x86_64.whl (1047.0 MB)
2025-05-07T20:29:00.7066560Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 GB 22.5 MB/s eta 0:00:00
2025-05-07T20:29:00.7067516Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cublas_cu12-12.8.3.14-py3-none-manylinux_2_27_x86_64.whl (609.6 MB)
2025-05-07T20:29:00.7068410Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 609.6/609.6 MB 53.4 MB/s eta 0:00:00
2025-05-07T20:29:00.7069182Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_cupti_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (10.2 MB)
2025-05-07T20:29:00.7070026Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 10.2/10.2 MB 176.3 MB/s eta 0:00:00
2025-05-07T20:29:00.7070789Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_nvrtc_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (88.0 MB)
2025-05-07T20:29:00.7071633Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 88.0/88.0 MB 169.6 MB/s eta 0:00:00
2025-05-07T20:29:00.7072397Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cuda_runtime_cu12-12.8.57-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (954 kB)
2025-05-07T20:29:00.7073263Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 954.8/954.8 kB 103.2 MB/s eta 0:00:00
2025-05-07T20:29:00.7073936Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cudnn_cu12-9.8.0.87-py3-none-manylinux_2_27_x86_64.whl (698.0 MB)
2025-05-07T20:29:00.7074692Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 698.0/698.0 MB 45.3 MB/s eta 0:00:00
2025-05-07T20:29:00.7075450Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufft_cu12-11.3.3.41-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (193.1 MB)
2025-05-07T20:29:00.7076295Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 193.1/193.1 MB 89.3 MB/s eta 0:00:00
2025-05-07T20:29:00.7077044Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cufile_cu12-1.13.0.11-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.2 MB)
2025-05-07T20:29:00.7078022Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 66.4 MB/s eta 0:00:00
2025-05-07T20:29:00.7078727Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_curand_cu12-10.3.9.55-py3-none-manylinux_2_27_x86_64.whl (63.6 MB)
2025-05-07T20:29:00.7079481Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63.6/63.6 MB 148.7 MB/s eta 0:00:00
2025-05-07T20:29:00.7080180Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusolver_cu12-11.7.2.55-py3-none-manylinux_2_27_x86_64.whl (260.4 MB)
2025-05-07T20:29:00.7081066Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 260.4/260.4 MB 107.4 MB/s eta 0:00:00
2025-05-07T20:29:00.7081845Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparse_cu12-12.5.7.53-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (292.1 MB)
2025-05-07T20:29:00.7082695Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 292.1/292.1 MB 93.1 MB/s eta 0:00:00
2025-05-07T20:29:00.7083382Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl (156.8 MB)
2025-05-07T20:29:00.7084180Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 156.8/156.8 MB 136.5 MB/s eta 0:00:00
2025-05-07T20:29:00.7085160Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB)
2025-05-07T20:29:00.7086003Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 201.3/201.3 MB 130.4 MB/s eta 0:00:00
2025-05-07T20:29:00.7086764Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvjitlink_cu12-12.8.61-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (39.2 MB)
2025-05-07T20:29:00.7087602Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.2/39.2 MB 161.2 MB/s eta 0:00:00
2025-05-07T20:29:00.7088338Z Downloading https://download.pytorch.org/whl/nightly/cu128/nvidia_nvtx_cu12-12.8.55-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (89 kB)
2025-05-07T20:29:00.7089474Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.3.0%2Bgit96316ce5-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (153.5 MB)
2025-05-07T20:29:00.7090352Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 153.5/153.5 MB 125.7 MB/s eta 0:00:00
2025-05-07T20:29:00.7090731Z Building wheels for collected packages: MarkupSafe
2025-05-07T20:29:00.7091116Z   Building wheel for MarkupSafe (setup.py): started
2025-05-07T20:29:00.7091551Z   Building wheel for MarkupSafe (setup.py): finished with status 'done'
2025-05-07T20:29:00.7092396Z   Created wheel for MarkupSafe: filename=markupsafe-2.1.5-cp313-cp313-linux_x86_64.whl size=14954 sha256=c651adbbf11229a5595504d32ca1e5d9b02f5c896a75bb208e770b56236dac00
2025-05-07T20:29:00.7093410Z   Stored in directory: /home/ec2-user/.cache/pip/wheels/3a/21/87/28c44597225fd0c28d6ffa365f1c2c9dd0ab763711aa4957c6
2025-05-07T20:29:00.7094009Z Successfully built MarkupSafe
2025-05-07T20:29:00.7095700Z Installing collected packages: nvidia-cusparselt-cu12, mpmath, sympy, pytorch-triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, nvidia-cusparse-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch
2025-05-07T20:29:00.7097278Z 
2025-05-07T20:29:00.7099218Z Successfully installed MarkupSafe-2.1.5 filelock-3.16.1 fsspec-2024.10.0 jinja2-3.1.4 mpmath-1.3.0 networkx-3.4.2 nvidia-cublas-cu12-12.8.3.14 nvidia-cuda-cupti-cu12-12.8.57 nvidia-cuda-nvrtc-cu12-12.8.61 nvidia-cuda-runtime-cu12-12.8.57 nvidia-cudnn-cu12-9.8.0.87 nvidia-cufft-cu12-11.3.3.41 nvidia-cufile-cu12-1.13.0.11 nvidia-curand-cu12-10.3.9.55 nvidia-cusolver-cu12-11.7.2.55 nvidia-cusparse-cu12-12.5.7.53 nvidia-cusparselt-cu12-0.6.3 nvidia-nccl-cu12-2.26.2 nvidia-nvjitlink-cu12-12.8.61 nvidia-nvtx-cu12-12.8.55 pytorch-triton-3.3.0+git96316ce5 sympy-1.13.3 torch-2.8.0.dev20250507+cu128
2025-05-07T20:29:00.7101233Z 
2025-05-07T20:29:02.9352652Z torch                    2.8.0.dev20250507+cu128
2025-05-07T20:29:02.9354666Z [CHECK] The installed package [torch, nightly/LATEST] is the correct variant (cu128)
2025-05-07T20:29:06.4122360Z [CHECK] Python (sub-)package 'torch.distributed' found ...
2025-05-07T20:29:09.9015744Z [CHECK] NOTE: The installed version is: 2.8.0.dev20250507+cu128
2025-05-07T20:29:09.9016168Z [CHECK] NOTE: Checking _GLIBCXX_USE_CXX11_ABI ...
2025-05-07T20:29:13.2980487Z True
2025-05-07T20:29:13.2980715Z True
2025-05-07T20:29:13.2980822Z 
2025-05-07T20:29:13.3599066Z [INSTALL] Successfully installed PyTorch through PyTorch PIP
2025-05-07T20:29:13.3646109Z ##[group]Run if . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi
2025-05-07T20:29:13.3646722Z [36;1mif . $PRELUDE && which conda; then collect_pytorch_env_info $BUILD_ENV; fi[0m
2025-05-07T20:29:13.3660989Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:29:13.3661330Z env:
2025-05-07T20:29:13.3661553Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:29:13.3661849Z   BUILD_ENV: build_binary
2025-05-07T20:29:13.3662092Z   BUILD_TARGET: genai
2025-05-07T20:29:13.3662501Z   BUILD_VARIANT: cuda
2025-05-07T20:29:13.3662735Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:29:13.3662987Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:29:13.3663285Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:29:13.3663620Z ##[endgroup]
2025-05-07T20:29:13.6995917Z /home/ec2-user/miniconda/bin/conda
2025-05-07T20:29:13.6997582Z ################################################################################
2025-05-07T20:29:13.6998208Z # Collect PyTorch Environment Information (for Reporting Issues)
2025-05-07T20:29:13.6998577Z #
2025-05-07T20:29:13.7013320Z # [2025-05-07T20:29:13.701Z] + collect_pytorch_env_info build_binary
2025-05-07T20:29:13.7013815Z ################################################################################
2025-05-07T20:29:13.7014132Z 
2025-05-07T20:29:13.7028664Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:29:13.7955144Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:29:13.7965465Z [INFO] Downloading the PyTorch environment info collection script ...
2025-05-07T20:29:13.7966357Z + wget -q https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
2025-05-07T20:29:13.7966866Z 
2025-05-07T20:29:13.8853531Z 
2025-05-07T20:29:13.8854037Z [INFO] Collecting PyTorch environment info (will be needed for reporting issues to PyTorch) ...
2025-05-07T20:29:13.8878174Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary python collect_env.py
2025-05-07T20:29:19.8321706Z Collecting environment information...
2025-05-07T20:29:19.8322246Z PyTorch version: 2.8.0.dev20250507+cu128
2025-05-07T20:29:19.8322654Z Is debug build: False
2025-05-07T20:29:19.8322925Z CUDA used to build PyTorch: 12.8
2025-05-07T20:29:19.8323197Z ROCM used to build PyTorch: N/A
2025-05-07T20:29:19.8323374Z 
2025-05-07T20:29:19.8323474Z OS: Amazon Linux 2023.6.20250317 (x86_64)
2025-05-07T20:29:19.8323784Z GCC version: (conda-forge gcc 11.4.0-13) 11.4.0
2025-05-07T20:29:19.8324093Z Clang version: Could not collect
2025-05-07T20:29:19.8324369Z CMake version: Could not collect
2025-05-07T20:29:19.8324628Z Libc version: glibc-2.34
2025-05-07T20:29:19.8324778Z 
2025-05-07T20:29:19.8325082Z Python version: 3.13.0 | packaged by conda-forge | (main, Nov 27 2024, 19:18:50) [GCC 13.3.0] (64-bit runtime)
2025-05-07T20:29:19.8325678Z Python platform: Linux-6.1.130-139.222.amzn2023.x86_64-x86_64-with-glibc2.34
2025-05-07T20:29:19.8326081Z Is CUDA available: True
2025-05-07T20:29:19.8326340Z CUDA runtime version: 12.8.61
2025-05-07T20:29:19.8326599Z CUDA_MODULE_LOADING set to: LAZY
2025-05-07T20:29:19.8326896Z GPU models and configuration: GPU 0: NVIDIA A10G
2025-05-07T20:29:19.8327228Z Nvidia driver version: 570.133.07
2025-05-07T20:29:19.8327493Z cuDNN version: Could not collect
2025-05-07T20:29:19.8327754Z HIP runtime version: N/A
2025-05-07T20:29:19.8327999Z MIOpen runtime version: N/A
2025-05-07T20:29:19.8328252Z Is XNNPACK available: True
2025-05-07T20:29:19.8328421Z 
2025-05-07T20:29:19.8328495Z CPU:
2025-05-07T20:29:19.8328712Z Architecture:                         x86_64
2025-05-07T20:29:19.8329040Z CPU op-mode(s):                       32-bit, 64-bit
2025-05-07T20:29:19.8329422Z Address sizes:                        48 bits physical, 48 bits virtual
2025-05-07T20:29:19.8329799Z Byte Order:                           Little Endian
2025-05-07T20:29:19.8330113Z CPU(s):                               16
2025-05-07T20:29:19.8330392Z On-line CPU(s) list:                  0-15
2025-05-07T20:29:19.8331037Z Vendor ID:                            AuthenticAMD
2025-05-07T20:29:19.8331384Z Model name:                           AMD EPYC 7R32
2025-05-07T20:29:19.8331689Z CPU family:                           23
2025-05-07T20:29:19.8331965Z Model:                                49
2025-05-07T20:29:19.8332245Z Thread(s) per core:                   2
2025-05-07T20:29:19.8332518Z Core(s) per socket:                   8
2025-05-07T20:29:19.8332791Z Socket(s):                            1
2025-05-07T20:29:19.8333061Z Stepping:                             0
2025-05-07T20:29:19.8333499Z BogoMIPS:                             5598.98
2025-05-07T20:29:19.8335523Z Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
2025-05-07T20:29:19.8339693Z Hypervisor vendor:                    KVM
2025-05-07T20:29:19.8340003Z Virtualization type:                  full
2025-05-07T20:29:19.8340819Z L1d cache:                            256 KiB (8 instances)
2025-05-07T20:29:19.8341278Z L1i cache:                            256 KiB (8 instances)
2025-05-07T20:29:19.8341639Z L2 cache:                             4 MiB (8 instances)
2025-05-07T20:29:19.8341988Z L3 cache:                             32 MiB (2 instances)
2025-05-07T20:29:19.8342303Z NUMA node(s):                         1
2025-05-07T20:29:19.8342585Z NUMA node0 CPU(s):                    0-15
2025-05-07T20:29:19.8342931Z Vulnerability Gather data sampling:   Not affected
2025-05-07T20:29:19.8343292Z Vulnerability Itlb multihit:          Not affected
2025-05-07T20:29:19.8343645Z Vulnerability L1tf:                   Not affected
2025-05-07T20:29:19.8343995Z Vulnerability Mds:                    Not affected
2025-05-07T20:29:19.8344334Z Vulnerability Meltdown:               Not affected
2025-05-07T20:29:19.8344682Z Vulnerability Mmio stale data:        Not affected
2025-05-07T20:29:19.8345040Z Vulnerability Reg file data sampling: Not affected
2025-05-07T20:29:19.8345565Z Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT enabled with STIBP protection
2025-05-07T20:29:19.8346139Z Vulnerability Spec rstack overflow:   Mitigation; safe RET
2025-05-07T20:29:19.8346670Z Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
2025-05-07T20:29:19.8347350Z Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
2025-05-07T20:29:19.8348297Z Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
2025-05-07T20:29:19.8348957Z Vulnerability Srbds:                  Not affected
2025-05-07T20:29:19.8349313Z Vulnerability Tsx async abort:        Not affected
2025-05-07T20:29:19.8349537Z 
2025-05-07T20:29:19.8349644Z Versions of relevant libraries:
2025-05-07T20:29:19.8349902Z [pip3] numpy==2.2.5
2025-05-07T20:29:19.8350144Z [pip3] nvidia-cublas-cu12==12.8.3.14
2025-05-07T20:29:19.8350453Z [pip3] nvidia-cuda-cupti-cu12==12.8.57
2025-05-07T20:29:19.8350749Z [pip3] nvidia-cuda-nvrtc-cu12==12.8.61
2025-05-07T20:29:19.8351057Z [pip3] nvidia-cuda-runtime-cu12==12.8.57
2025-05-07T20:29:19.8351365Z [pip3] nvidia-cudnn-cu12==9.8.0.87
2025-05-07T20:29:19.8351640Z [pip3] nvidia-cufft-cu12==11.3.3.41
2025-05-07T20:29:19.8351925Z [pip3] nvidia-curand-cu12==10.3.9.55
2025-05-07T20:29:19.8352218Z [pip3] nvidia-cusolver-cu12==11.7.2.55
2025-05-07T20:29:19.8352505Z [pip3] nvidia-cusparse-cu12==12.5.7.53
2025-05-07T20:29:19.8353029Z [pip3] nvidia-cusparselt-cu12==0.6.3
2025-05-07T20:29:19.8353327Z [pip3] nvidia-nccl-cu12==2.26.2
2025-05-07T20:29:19.8353609Z [pip3] nvidia-nvjitlink-cu12==12.8.61
2025-05-07T20:29:19.8353894Z [pip3] nvidia-nvtx-cu12==12.8.55
2025-05-07T20:29:19.8354173Z [pip3] pytorch-triton==3.3.0+git96316ce5
2025-05-07T20:29:19.8354465Z [pip3] torch==2.8.0.dev20250507+cu128
2025-05-07T20:29:19.8354827Z [conda] cuda-cudart               12.8.57              h5888daf_1    conda-forge
2025-05-07T20:29:19.8355299Z [conda] cuda-cudart-dev           12.8.57              h5888daf_1    conda-forge
2025-05-07T20:29:19.8355924Z [conda] cuda-cudart-dev_linux-64  12.8.57              h3f2d84a_1    conda-forge
2025-05-07T20:29:19.8356424Z [conda] cuda-cudart-static        12.8.57              h5888daf_1    conda-forge
2025-05-07T20:29:19.8356938Z [conda] cuda-cudart-static_linux-64 12.8.57              h3f2d84a_1    conda-forge
2025-05-07T20:29:19.8357459Z [conda] cuda-cudart_linux-64      12.8.57              h3f2d84a_1    conda-forge
2025-05-07T20:29:19.8357932Z [conda] cuda-cupti                12.8.57              hbd13f7d_0    conda-forge
2025-05-07T20:29:19.8358376Z [conda] cuda-cupti-dev            12.8.57              h5888daf_0    conda-forge
2025-05-07T20:29:19.8358837Z [conda] cuda-libraries            12.8.0               ha770c72_0    conda-forge
2025-05-07T20:29:19.8359312Z [conda] cuda-libraries-dev        12.8.0               ha770c72_0    conda-forge
2025-05-07T20:29:19.8359773Z [conda] cuda-nvrtc                12.8.61              hbd13f7d_0    conda-forge
2025-05-07T20:29:19.8360222Z [conda] cuda-nvrtc-dev            12.8.61              h5888daf_0    conda-forge
2025-05-07T20:29:19.8360662Z [conda] cuda-nvtx                 12.8.55              hbd13f7d_0    conda-forge
2025-05-07T20:29:19.8361100Z [conda] cuda-opencl               12.8.55              hbd13f7d_0    conda-forge
2025-05-07T20:29:19.8361551Z [conda] cuda-opencl-dev           12.8.55              h5888daf_0    conda-forge
2025-05-07T20:29:19.8362017Z [conda] cuda-runtime              12.8.0               ha804496_0    conda-forge
2025-05-07T20:29:19.8362459Z [conda] libcublas                 12.8.3.14            h9ab20c4_0    conda-forge
2025-05-07T20:29:19.8362910Z [conda] libcublas-dev             12.8.3.14            h9ab20c4_0    conda-forge
2025-05-07T20:29:19.8363352Z [conda] libcufft                  11.3.3.41            hbd13f7d_0    conda-forge
2025-05-07T20:29:19.8363795Z [conda] libcufft-dev              11.3.3.41            h5888daf_0    conda-forge
2025-05-07T20:29:19.8364245Z [conda] libcurand                 10.3.9.55            hbd13f7d_0    conda-forge
2025-05-07T20:29:19.8364694Z [conda] libcurand-dev             10.3.9.55            h5888daf_0    conda-forge
2025-05-07T20:29:19.8365152Z [conda] libcusolver               11.7.2.55            h9ab20c4_0    conda-forge
2025-05-07T20:29:19.8365618Z [conda] libcusolver-dev           11.7.2.55            h9ab20c4_0    conda-forge
2025-05-07T20:29:19.8366086Z [conda] libcusparse               12.5.7.53            hbd13f7d_0    conda-forge
2025-05-07T20:29:19.8366543Z [conda] libcusparse-dev           12.5.7.53            h5888daf_0    conda-forge
2025-05-07T20:29:19.8367013Z [conda] libnvjitlink              12.8.61              hbd13f7d_0    conda-forge
2025-05-07T20:29:19.8367482Z [conda] libnvjitlink-dev          12.8.61              h5888daf_0    conda-forge
2025-05-07T20:29:19.8367926Z [conda] numpy                     2.2.5           py313h17eae1a_0    conda-forge
2025-05-07T20:29:19.8368381Z [conda] nvidia-cublas-cu12        12.8.3.14                pypi_0    pypi
2025-05-07T20:29:19.8368914Z [conda] nvidia-cuda-cupti-cu12    12.8.57                  pypi_0    pypi
2025-05-07T20:29:19.8369392Z [conda] nvidia-cuda-nvrtc-cu12    12.8.61                  pypi_0    pypi
2025-05-07T20:29:19.8369886Z [conda] nvidia-cuda-runtime-cu12  12.8.57                  pypi_0    pypi
2025-05-07T20:29:19.8370360Z [conda] nvidia-cudnn-cu12         9.8.0.87                 pypi_0    pypi
2025-05-07T20:29:19.8370929Z [conda] nvidia-cufft-cu12         11.3.3.41                pypi_0    pypi
2025-05-07T20:29:19.8371391Z [conda] nvidia-curand-cu12        10.3.9.55                pypi_0    pypi
2025-05-07T20:29:19.8371852Z [conda] nvidia-cusolver-cu12      11.7.2.55                pypi_0    pypi
2025-05-07T20:29:19.8372321Z [conda] nvidia-cusparse-cu12      12.5.7.53                pypi_0    pypi
2025-05-07T20:29:19.8372798Z [conda] nvidia-cusparselt-cu12    0.6.3                    pypi_0    pypi
2025-05-07T20:29:19.8373265Z [conda] nvidia-nccl-cu12          2.26.2                   pypi_0    pypi
2025-05-07T20:29:19.8373812Z [conda] nvidia-nvjitlink-cu12     12.8.61                  pypi_0    pypi
2025-05-07T20:29:19.8374271Z [conda] nvidia-nvtx-cu12          12.8.55                  pypi_0    pypi
2025-05-07T20:29:19.8374728Z [conda] pytorch-triton            3.3.0+git96316ce5          pypi_0    pypi
2025-05-07T20:29:19.8375167Z [conda] torch                     2.8.0.dev20250507+cu128          pypi_0    pypi
2025-05-07T20:29:19.8375436Z 
2025-05-07T20:29:19.9076737Z ##[group]Run . $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV
2025-05-07T20:29:19.9077400Z [36;1m. $PRELUDE; cd fbgemm_gpu; prepare_fbgemm_gpu_build $BUILD_ENV[0m
2025-05-07T20:29:19.9090023Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:29:19.9090364Z env:
2025-05-07T20:29:19.9090583Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:29:19.9090866Z   BUILD_ENV: build_binary
2025-05-07T20:29:19.9091105Z   BUILD_TARGET: genai
2025-05-07T20:29:19.9091332Z   BUILD_VARIANT: cuda
2025-05-07T20:29:19.9091580Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:29:19.9091826Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:29:19.9092123Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:29:19.9092454Z ##[endgroup]
2025-05-07T20:29:20.2483984Z ################################################################################
2025-05-07T20:29:20.2484474Z # Prepare FBGEMM-GPU Build
2025-05-07T20:29:20.2484791Z #
2025-05-07T20:29:20.2499711Z # [2025-05-07T20:29:20.249Z] + prepare_fbgemm_gpu_build build_binary
2025-05-07T20:29:20.2500252Z ################################################################################
2025-05-07T20:29:20.2500540Z 
2025-05-07T20:29:20.2515190Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:29:20.3547675Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:29:20.3567407Z [BUILD] Running git submodules update ...
2025-05-07T20:29:20.3589023Z [EXEC] [ATTEMPT 0/3]    + git submodule sync
2025-05-07T20:29:20.3953751Z Synchronizing submodule url for '../external/asmjit'
2025-05-07T20:29:20.3954213Z Synchronizing submodule url for '../external/composable_kernel'
2025-05-07T20:29:20.3954653Z Synchronizing submodule url for '../external/cpuinfo'
2025-05-07T20:29:20.3955036Z Synchronizing submodule url for '../external/cutlass'
2025-05-07T20:29:20.3955435Z Synchronizing submodule url for '../external/googletest'
2025-05-07T20:29:20.3955875Z Synchronizing submodule url for '../external/hipify_torch'
2025-05-07T20:29:20.3956285Z Synchronizing submodule url for '../external/json'
2025-05-07T20:29:20.3989463Z [EXEC] [ATTEMPT 0/3]    + git submodule update --init --recursive
2025-05-07T20:29:20.4538848Z [BUILD] Installing other build dependencies ...
2025-05-07T20:29:20.4560819Z [EXEC] [ATTEMPT 0/3]    + conda run --no-capture-output -n build_binary python -m pip install -r requirements.txt
2025-05-07T20:29:22.8631458Z Collecting backports.tarfile (from -r requirements.txt (line 13))
2025-05-07T20:29:22.8805037Z   Downloading backports.tarfile-1.2.0-py3-none-any.whl.metadata (2.0 kB)
2025-05-07T20:29:22.9721894Z Collecting build (from -r requirements.txt (line 14))
2025-05-07T20:29:23.0200511Z   Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
2025-05-07T20:29:23.2145868Z Collecting cmake (from -r requirements.txt (line 15))
2025-05-07T20:29:23.2176229Z   Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.3 kB)
2025-05-07T20:29:23.3178564Z Collecting click (from -r requirements.txt (line 16))
2025-05-07T20:29:23.3203443Z   Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB)
2025-05-07T20:29:23.6186010Z Collecting hypothesis (from -r requirements.txt (line 17))
2025-05-07T20:29:23.6218664Z   Downloading hypothesis-6.131.14-py3-none-any.whl.metadata (5.6 kB)
2025-05-07T20:29:23.6782659Z Requirement already satisfied: jinja2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from -r requirements.txt (line 18)) (3.1.4)
2025-05-07T20:29:23.6786379Z Requirement already satisfied: mpmath==1.3.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from -r requirements.txt (line 19)) (1.3.0)
2025-05-07T20:29:23.7466710Z Collecting ninja (from -r requirements.txt (line 20))
2025-05-07T20:29:23.7495315Z   Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl.metadata (5.0 kB)
2025-05-07T20:29:23.7999781Z Requirement already satisfied: numpy>=2.0.2 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from -r requirements.txt (line 21)) (2.2.5)
2025-05-07T20:29:23.8558934Z Collecting pyre-extensions (from -r requirements.txt (line 22))
2025-05-07T20:29:23.8614441Z   Downloading pyre_extensions-0.0.32-py3-none-any.whl.metadata (4.0 kB)
2025-05-07T20:29:23.9780353Z Collecting pyyaml (from -r requirements.txt (line 23))
2025-05-07T20:29:23.9806605Z   Downloading PyYAML-6.0.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB)
2025-05-07T20:29:24.0836835Z Collecting scikit-build (from -r requirements.txt (line 24))
2025-05-07T20:29:24.0868438Z   Downloading scikit_build-0.18.1-py3-none-any.whl.metadata (18 kB)
2025-05-07T20:29:24.1294995Z Requirement already satisfied: setuptools in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from -r requirements.txt (line 25)) (78.1.1)
2025-05-07T20:29:24.1900697Z Collecting setuptools_git_versioning (from -r requirements.txt (line 26))
2025-05-07T20:29:24.1925793Z   Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl.metadata (6.1 kB)
2025-05-07T20:29:24.2792186Z Collecting tabulate (from -r requirements.txt (line 27))
2025-05-07T20:29:24.2824154Z   Downloading tabulate-0.9.0-py3-none-any.whl.metadata (34 kB)
2025-05-07T20:29:24.3843008Z Collecting patchelf (from -r requirements.txt (line 28))
2025-05-07T20:29:24.3880959Z   Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl.metadata (3.5 kB)
2025-05-07T20:29:24.4961048Z Collecting packaging>=19.1 (from build->-r requirements.txt (line 14))
2025-05-07T20:29:24.4994155Z   Downloading packaging-25.0-py3-none-any.whl.metadata (3.3 kB)
2025-05-07T20:29:24.5935970Z Collecting pyproject_hooks (from build->-r requirements.txt (line 14))
2025-05-07T20:29:24.5963423Z   Downloading pyproject_hooks-1.2.0-py3-none-any.whl.metadata (1.3 kB)
2025-05-07T20:29:24.7012026Z Collecting attrs>=22.2.0 (from hypothesis->-r requirements.txt (line 17))
2025-05-07T20:29:24.7037379Z   Downloading attrs-25.3.0-py3-none-any.whl.metadata (10 kB)
2025-05-07T20:29:24.8136706Z Collecting sortedcontainers<3.0.0,>=2.1.0 (from hypothesis->-r requirements.txt (line 17))
2025-05-07T20:29:24.8172229Z   Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl.metadata (10 kB)
2025-05-07T20:29:24.8691747Z Requirement already satisfied: MarkupSafe>=2.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from jinja2->-r requirements.txt (line 18)) (2.1.5)
2025-05-07T20:29:24.9145938Z Collecting typing-inspect (from pyre-extensions->-r requirements.txt (line 22))
2025-05-07T20:29:24.9195393Z   Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
2025-05-07T20:29:24.9666020Z Requirement already satisfied: typing-extensions in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from pyre-extensions->-r requirements.txt (line 22)) (4.13.2)
2025-05-07T20:29:25.0150977Z Collecting distro (from scikit-build->-r requirements.txt (line 24))
2025-05-07T20:29:25.0178224Z   Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
2025-05-07T20:29:25.0655210Z Requirement already satisfied: wheel>=0.32.0 in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from scikit-build->-r requirements.txt (line 24)) (0.45.1)
2025-05-07T20:29:25.1261246Z Collecting mypy-extensions>=0.3.0 (from typing-inspect->pyre-extensions->-r requirements.txt (line 22))
2025-05-07T20:29:25.1290324Z   Downloading mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB)
2025-05-07T20:29:25.1754641Z Downloading backports.tarfile-1.2.0-py3-none-any.whl (30 kB)
2025-05-07T20:29:25.2249842Z Downloading build-1.2.2.post1-py3-none-any.whl (22 kB)
2025-05-07T20:29:25.2780044Z Downloading cmake-4.0.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.9 MB)
2025-05-07T20:29:25.7938162Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27.9/27.9 MB 54.0 MB/s eta 0:00:00
2025-05-07T20:29:25.7968639Z Downloading click-8.1.8-py3-none-any.whl (98 kB)
2025-05-07T20:29:25.8509437Z Downloading hypothesis-6.131.14-py3-none-any.whl (500 kB)
2025-05-07T20:29:25.9108258Z Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB)
2025-05-07T20:29:25.9603058Z Downloading ninja-1.11.1.4-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (422 kB)
2025-05-07T20:29:26.0142119Z Downloading pyre_extensions-0.0.32-py3-none-any.whl (12 kB)
2025-05-07T20:29:26.0655155Z Downloading PyYAML-6.0.2-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (759 kB)
2025-05-07T20:29:26.1216444Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 759.5/759.5 kB 9.2 MB/s eta 0:00:00
2025-05-07T20:29:26.1244281Z Downloading scikit_build-0.18.1-py3-none-any.whl (85 kB)
2025-05-07T20:29:26.1726754Z Downloading setuptools_git_versioning-2.1.0-py3-none-any.whl (10 kB)
2025-05-07T20:29:26.2248736Z Downloading tabulate-0.9.0-py3-none-any.whl (35 kB)
2025-05-07T20:29:26.2760802Z Downloading patchelf-0.17.2.2-py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.musllinux_1_1_x86_64.whl (466 kB)
2025-05-07T20:29:26.3368300Z Downloading attrs-25.3.0-py3-none-any.whl (63 kB)
2025-05-07T20:29:26.3851413Z Downloading packaging-25.0-py3-none-any.whl (66 kB)
2025-05-07T20:29:26.4283576Z Downloading distro-1.9.0-py3-none-any.whl (20 kB)
2025-05-07T20:29:26.4823130Z Downloading pyproject_hooks-1.2.0-py3-none-any.whl (10 kB)
2025-05-07T20:29:26.5311749Z Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB)
2025-05-07T20:29:26.5805670Z Downloading mypy_extensions-1.1.0-py3-none-any.whl (5.0 kB)
2025-05-07T20:29:26.7467840Z Installing collected packages: sortedcontainers, tabulate, pyyaml, pyproject_hooks, patchelf, packaging, ninja, mypy-extensions, distro, cmake, click, backports.tarfile, attrs, typing-inspect, setuptools_git_versioning, scikit-build, hypothesis, build, pyre-extensions
2025-05-07T20:29:28.9943768Z 
2025-05-07T20:29:28.9994395Z Successfully installed attrs-25.3.0 backports.tarfile-1.2.0 build-1.2.2.post1 click-8.1.8 cmake-4.0.0 distro-1.9.0 hypothesis-6.131.14 mypy-extensions-1.1.0 ninja-1.11.1.4 packaging-25.0 patchelf-0.17.2.2 pyproject_hooks-1.2.0 pyre-extensions-0.0.32 pyyaml-6.0.2 scikit-build-0.18.1 setuptools_git_versioning-2.1.0 sortedcontainers-2.4.0 tabulate-0.9.0 typing-inspect-0.9.0
2025-05-07T20:29:29.1737657Z ################################################################################
2025-05-07T20:29:29.1738005Z # Install PyTorch (PyTorch PIP)
2025-05-07T20:29:29.1738420Z #
2025-05-07T20:29:29.1755763Z # [2025-05-07T20:29:29.175Z] + install_triton_pip build_binary
2025-05-07T20:29:29.1756144Z ################################################################################
2025-05-07T20:29:29.1756384Z 
2025-05-07T20:29:29.1756605Z [BUILD] Installing pytorch-triton nightly/3.2.0+git4b3bb1f8 from PIP ...
2025-05-07T20:29:29.1757036Z ################################################################################
2025-05-07T20:29:29.1757386Z # Install Package From PyTorch PIP: pytorch-triton
2025-05-07T20:29:29.1757687Z #
2025-05-07T20:29:29.1772386Z # [2025-05-07T20:29:29.176Z] + install_from_pytorch_pip build_binary pytorch-triton nightly/3.2.0+git4b3bb1f8
2025-05-07T20:29:29.1772899Z ################################################################################
2025-05-07T20:29:29.1773108Z 
2025-05-07T20:29:29.1787885Z [EXEC] [ATTEMPT 0/3]    + wget -q --timeout 1 pypi.org -O /dev/null
2025-05-07T20:29:29.2680092Z [CHECK] Network does not appear to be blocked.
2025-05-07T20:29:29.2680427Z ################################################################################
2025-05-07T20:29:29.2680753Z # Prepare PIP Arguments (PyTorch PIP)
2025-05-07T20:29:29.2681268Z #
2025-05-07T20:29:29.2700589Z # [2025-05-07T20:29:29.269Z] + __prepare_pip_arguments pytorch-triton nightly/3.2.0+git4b3bb1f8 
2025-05-07T20:29:29.2701228Z ################################################################################
2025-05-07T20:29:29.2701508Z 
2025-05-07T20:29:29.2750584Z [INSTALL] Extracted package (channel, version): (nightly, 3.2.0+git4b3bb1f8)
2025-05-07T20:29:29.2767238Z [INSTALL] Using a non-RELEASE channel: nightly ...
2025-05-07T20:29:29.2767806Z [INSTALL] Extracted the full PIP channel: https://download.pytorch.org/whl/nightly/
2025-05-07T20:29:29.2776059Z [INSTALL] Extracted the full PIP package: --pre pytorch-triton==3.2.0+git4b3bb1f8
2025-05-07T20:29:29.2785488Z [INSTALL] Attempting to install [pytorch-triton, 3.2.0+git4b3bb1f8] from PyTorch PIP using channel https://download.pytorch.org/whl/nightly/ ...
2025-05-07T20:29:29.2806690Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary pip install --pre pytorch-triton==3.2.0+git4b3bb1f8 --index-url https://download.pytorch.org/whl/nightly/
2025-05-07T20:29:36.8607823Z ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
2025-05-07T20:29:36.8609062Z torch 2.8.0.dev20250507+cu128 requires pytorch-triton==3.3.0+git96316ce5; platform_system == "Linux", but you have pytorch-triton 3.2.0+git4b3bb1f8 which is incompatible.
2025-05-07T20:29:36.8609730Z 
2025-05-07T20:29:36.8609958Z Looking in indexes: https://download.pytorch.org/whl/nightly/
2025-05-07T20:29:36.8610361Z Collecting pytorch-triton==3.2.0+git4b3bb1f8
2025-05-07T20:29:36.8611188Z   Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.3 kB)
2025-05-07T20:29:36.8612410Z Downloading https://download.pytorch.org/whl/nightly/pytorch_triton-3.2.0%2Bgit4b3bb1f8-cp313-cp313-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (166.5 MB)
2025-05-07T20:29:36.8613504Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 166.5/166.5 MB 58.2 MB/s eta 0:00:00
2025-05-07T20:29:36.8613902Z Installing collected packages: pytorch-triton
2025-05-07T20:29:36.8614246Z   Attempting uninstall: pytorch-triton
2025-05-07T20:29:36.8614629Z     Found existing installation: pytorch-triton 3.3.0+git96316ce5
2025-05-07T20:29:36.8615045Z     Uninstalling pytorch-triton-3.3.0+git96316ce5:
2025-05-07T20:29:36.8615479Z       Successfully uninstalled pytorch-triton-3.3.0+git96316ce5
2025-05-07T20:29:36.8615920Z Successfully installed pytorch-triton-3.2.0+git4b3bb1f8
2025-05-07T20:29:36.8616170Z 
2025-05-07T20:29:39.1002995Z [CHECK] Python (sub-)package 'triton' found ...
2025-05-07T20:29:39.1006892Z [CHECK] Printing out the pytorch-triton version ...
2025-05-07T20:29:41.2603730Z ################################################################################
2025-05-07T20:29:41.2604310Z [CHECK] The installed VERSION of pytorch-triton is: 3.2.0
2025-05-07T20:29:41.2604838Z ################################################################################
2025-05-07T20:29:41.2605127Z 
2025-05-07T20:29:43.3325487Z [CHECK] Python (sub-)package 'numpy' found ...
2025-05-07T20:29:45.5033532Z [CHECK] Python (sub-)package 'skbuild' found ...
2025-05-07T20:29:45.5037598Z [BUILD] Successfully ran git submodules update
2025-05-07T20:29:45.5083171Z ##[group]Run . $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl
2025-05-07T20:29:45.5083639Z [36;1m. $PRELUDE; install_fbgemm_gpu_wheel $BUILD_ENV *.whl[0m
2025-05-07T20:29:45.5098022Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:29:45.5098368Z env:
2025-05-07T20:29:45.5098589Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:29:45.5098878Z   BUILD_ENV: build_binary
2025-05-07T20:29:45.5099120Z   BUILD_TARGET: genai
2025-05-07T20:29:45.5099341Z   BUILD_VARIANT: cuda
2025-05-07T20:29:45.5099563Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:29:45.5099814Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:29:45.5100355Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:29:45.5100686Z ##[endgroup]
2025-05-07T20:29:45.8485840Z ################################################################################
2025-05-07T20:29:45.8486331Z # Install FBGEMM-GPU from Wheel
2025-05-07T20:29:45.8486691Z #
2025-05-07T20:29:45.8502465Z # [2025-05-07T20:29:45.849Z] + install_fbgemm_gpu_wheel build_binary fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl
2025-05-07T20:29:45.8503126Z ################################################################################
2025-05-07T20:29:45.8503344Z 
2025-05-07T20:29:45.8503720Z [INSTALL] Printing out FBGEMM-GPU wheel SHA: fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl
2025-05-07T20:29:45.8504481Z + sha1sum fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl
2025-05-07T20:29:45.8504816Z 
2025-05-07T20:29:45.8662877Z e6e36b113f85d3aaa465a028688a068480db398f  fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl
2025-05-07T20:29:45.8665322Z 
2025-05-07T20:29:45.8665932Z + sha256sum fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl
2025-05-07T20:29:45.8666285Z 
2025-05-07T20:29:45.8851444Z ad0b4412d9939ed191fe39ed235330a3031fd537afb4dd426cb9ce0834b66e07  fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl
2025-05-07T20:29:45.8854031Z 
2025-05-07T20:29:45.8854347Z + md5sum fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl
2025-05-07T20:29:45.8854681Z 
2025-05-07T20:29:45.9183654Z 74a01928743b9ea024408833cc9e2c10  fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl
2025-05-07T20:29:45.9187063Z 
2025-05-07T20:29:45.9198532Z [INSTALL] Installing FBGEMM-GPU wheel: fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl ...
2025-05-07T20:29:45.9220147Z [EXEC] [ATTEMPT 0/3]    + conda run -n build_binary python -m pip install fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl
2025-05-07T20:29:48.7133749Z Processing ./fbgemm_gpu_genai_nightly-2025.5.7-cp313-cp313-manylinux_2_28_x86_64.whl
2025-05-07T20:29:48.7134716Z Requirement already satisfied: numpy in /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages (from fbgemm-gpu-genai-nightly==2025.5.7) (2.2.5)
2025-05-07T20:29:48.7135598Z Installing collected packages: fbgemm-gpu-genai-nightly
2025-05-07T20:29:48.7136033Z Successfully installed fbgemm-gpu-genai-nightly-2025.5.7
2025-05-07T20:29:48.7136293Z 
2025-05-07T20:29:55.6313119Z ################################################################################
2025-05-07T20:29:55.6313538Z [CHECK] !!!!    INFO    !!!!
2025-05-07T20:29:55.6313910Z [CHECK] The installed version of PyTorch is: 2.8.0.dev20250507+cu128
2025-05-07T20:29:55.6314323Z [CHECK] CUDA version reported by PyTorch is: 12.8
2025-05-07T20:29:55.6314640Z [CHECK]
2025-05-07T20:29:55.6314955Z [CHECK] NOTE: If the PyTorch package channel is different from the FBGEMM_GPU
2025-05-07T20:29:55.6315456Z [CHECK]       package channel; the package may be broken at runtime!!!
2025-05-07T20:29:55.6315869Z ################################################################################
2025-05-07T20:29:55.6316089Z 
2025-05-07T20:29:55.6316203Z [INSTALL] Checking imports and symbols ...
2025-05-07T20:29:59.6502215Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ...
2025-05-07T20:30:03.6490469Z [CHECK] Found symbol '__version__' in Python package 'fbgemm_gpu'.
2025-05-07T20:30:07.6389346Z [CHECK] Found symbol '__variant__' in Python package 'fbgemm_gpu'.
2025-05-07T20:30:07.6393026Z [CHECK] Printing out the FBGEMM-GPU version ...
2025-05-07T20:30:19.6441718Z ################################################################################
2025-05-07T20:30:19.6442297Z [CHECK] The installed FBGEMM TARGET is: genai
2025-05-07T20:30:19.6442655Z [CHECK] The installed FBGEMM VARIANT is: cuda
2025-05-07T20:30:19.6442989Z [CHECK] The installed FBGEMM VERSION is: 2025.5.7
2025-05-07T20:30:19.6443316Z ################################################################################
2025-05-07T20:30:19.6444015Z 
2025-05-07T20:30:27.6281736Z ################################################################################
2025-05-07T20:30:27.6282332Z [CHECK] FBGEMM_GPU Experimental Packages
2025-05-07T20:30:27.6283731Z [CHECK] fbgemm_gpu: ['__annotations__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__target__', '__variant__', '__version__', '_load_library', 'docs', 'fbgemm_genai_libraries', 'fbgemm_gpu', 'fbgemm_gpu_libraries', 'libraries_to_load', 'library', 'logging', 'open_source', 'os', 'split_embedding_configs', 'split_table_batched_embeddings_ops_common', 'torch', 'utils']
2025-05-07T20:30:27.6285306Z [CHECK] fbgemm_gpu.experimental: ['__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__']
2025-05-07T20:30:27.6285838Z ################################################################################
2025-05-07T20:30:27.6286051Z 
2025-05-07T20:30:27.6286223Z [INSTALL] Check for installation of Python sources ...
2025-05-07T20:30:31.6590486Z [CHECK] Python (sub-)package 'fbgemm_gpu.config' found ...
2025-05-07T20:30:35.6533067Z [CHECK] Python (sub-)package 'fbgemm_gpu.docs' found ...
2025-05-07T20:30:39.7828703Z [CHECK] Python (sub-)package 'fbgemm_gpu.quantize' found ...
2025-05-07T20:30:43.7996516Z [CHECK] Python (sub-)package 'fbgemm_gpu.tbe.cache' found ...
2025-05-07T20:30:43.8000973Z [INSTALL] Check for operator registrations ...
2025-05-07T20:30:47.7104052Z fbgemm.nccl_init
2025-05-07T20:30:47.7106070Z 
2025-05-07T20:30:47.7725546Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.nccl_init
2025-05-07T20:30:51.6801900Z fbgemm.gqa_attn_splitk
2025-05-07T20:30:51.6802115Z 
2025-05-07T20:30:51.7416790Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.gqa_attn_splitk
2025-05-07T20:30:55.6618710Z fbgemm.rope_qkv_decoding
2025-05-07T20:30:55.6618925Z 
2025-05-07T20:30:55.7239131Z [CHECK] FBGEMM_GPU operator appears to be correctly registered: torch.ops.fbgemm.rope_qkv_decoding
2025-05-07T20:30:55.7239721Z [INSTALL] FBGEMM-GPU installation through wheel completed ...
2025-05-07T20:30:55.7278263Z ##[group]Run . $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV
2025-05-07T20:30:55.7278706Z [36;1m. $PRELUDE; test_all_fbgemm_gpu_modules $BUILD_ENV[0m
2025-05-07T20:30:55.7292524Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:30:55.7292870Z env:
2025-05-07T20:30:55.7293100Z   PRELUDE: .github/scripts/setup_env.bash
2025-05-07T20:30:55.7293394Z   BUILD_ENV: build_binary
2025-05-07T20:30:55.7293647Z   BUILD_TARGET: genai
2025-05-07T20:30:55.7293876Z   BUILD_VARIANT: cuda
2025-05-07T20:30:55.7294105Z   BUILD_CUDA_VERSION: 12.8.0
2025-05-07T20:30:55.7294367Z   ENFORCE_CUDA_DEVICE: 1
2025-05-07T20:30:55.7294674Z   GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all
2025-05-07T20:30:55.7294998Z ##[endgroup]
2025-05-07T20:30:56.0639613Z ################################################################################
2025-05-07T20:30:56.0640008Z # Test All FBGEMM-GPU Modules
2025-05-07T20:30:56.0640542Z #
2025-05-07T20:30:56.0657218Z # [2025-05-07T20:30:56.065Z] + test_all_fbgemm_gpu_modules build_binary
2025-05-07T20:30:56.0657623Z ################################################################################
2025-05-07T20:30:56.0657831Z 
2025-05-07T20:31:04.0404547Z [TEST] Determined FBGEMM_GPU (target : variant) from installation: (genai : cuda)
2025-05-07T20:31:04.0405091Z [TEST] Will be running tests specific to this target and variant ...
2025-05-07T20:31:04.0405482Z [TEST] Determined the test directories:
2025-05-07T20:31:04.0405796Z fbgemm_gpu/experimental/gen_ai/test
2025-05-07T20:31:04.0406094Z fbgemm_gpu/experimental/example/test
2025-05-07T20:31:04.0406380Z fbgemm_gpu/experimental/gemm/test
2025-05-07T20:31:04.0406566Z 
2025-05-07T20:31:04.0416480Z [TEST] FBGEMM_GPU variant is cuda; configuring for CUDA-based testing ...
2025-05-07T20:31:04.0423123Z [TEST] Set environment variables for CUDA testing ...
2025-05-07T20:31:04.0423990Z + conda env config vars unset -n build_binary CUDA_VISIBLE_DEVICES
2025-05-07T20:31:04.0424279Z 
2025-05-07T20:31:04.4634126Z 
2025-05-07T20:31:04.4634452Z [TEST] Installing PyTest ...
2025-05-07T20:31:04.4658858Z [EXEC] [ATTEMPT 0/3]    + conda install -n build_binary -c conda-forge --override-channels -y pytest expecttest
2025-05-07T20:31:05.5654506Z Channels:
2025-05-07T20:31:05.5654796Z  - conda-forge
2025-05-07T20:31:05.5655018Z Platform: linux-64
2025-05-07T20:31:08.8996716Z Collecting package metadata (repodata.json): - \ | / done
2025-05-07T20:31:10.0496087Z Solving environment: \ | / done
2025-05-07T20:31:10.2780583Z 
2025-05-07T20:31:10.2781094Z ## Package Plan ##
2025-05-07T20:31:10.2781342Z 
2025-05-07T20:31:10.2781646Z   environment location: /home/ec2-user/miniconda/envs/build_binary
2025-05-07T20:31:10.2782074Z 
2025-05-07T20:31:10.2782176Z   added / updated specs:
2025-05-07T20:31:10.2782413Z     - expecttest
2025-05-07T20:31:10.2782652Z     - pytest
2025-05-07T20:31:10.2782768Z 
2025-05-07T20:31:10.2782782Z 
2025-05-07T20:31:10.2782897Z The following packages will be downloaded:
2025-05-07T20:31:10.2783114Z 
2025-05-07T20:31:10.2783230Z     package                    |            build
2025-05-07T20:31:10.2783540Z     ---------------------------|-----------------
2025-05-07T20:31:10.2783902Z     colorama-0.4.6             |     pyhd8ed1ab_1          26 KB  conda-forge
2025-05-07T20:31:10.2784455Z     exceptiongroup-1.2.2       |     pyhd8ed1ab_1          20 KB  conda-forge
2025-05-07T20:31:10.2785088Z     expecttest-0.3.0           |     pyhd8ed1ab_0          14 KB  conda-forge
2025-05-07T20:31:10.2785517Z     iniconfig-2.0.0            |     pyhd8ed1ab_1          11 KB  conda-forge
2025-05-07T20:31:10.2786113Z     packaging-25.0             |     pyh29332c3_1          61 KB  conda-forge
2025-05-07T20:31:10.2786691Z     pluggy-1.5.0               |     pyhd8ed1ab_1          23 KB  conda-forge
2025-05-07T20:31:10.2787236Z     pytest-8.3.5               |     pyhd8ed1ab_0         254 KB  conda-forge
2025-05-07T20:31:10.2788127Z     tomli-2.2.1                |     pyhd8ed1ab_1          19 KB  conda-forge
2025-05-07T20:31:10.2788507Z     ------------------------------------------------------------
2025-05-07T20:31:10.2788839Z                                            Total:         428 KB
2025-05-07T20:31:10.2789039Z 
2025-05-07T20:31:10.2789160Z The following NEW packages will be INSTALLED:
2025-05-07T20:31:10.2789375Z 
2025-05-07T20:31:10.2789565Z   colorama           conda-forge/noarch::colorama-0.4.6-pyhd8ed1ab_1 
2025-05-07T20:31:10.2790057Z   exceptiongroup     conda-forge/noarch::exceptiongroup-1.2.2-pyhd8ed1ab_1 
2025-05-07T20:31:10.2790580Z   expecttest         conda-forge/noarch::expecttest-0.3.0-pyhd8ed1ab_0 
2025-05-07T20:31:10.2791033Z   iniconfig          conda-forge/noarch::iniconfig-2.0.0-pyhd8ed1ab_1 
2025-05-07T20:31:10.2791481Z   packaging          conda-forge/noarch::packaging-25.0-pyh29332c3_1 
2025-05-07T20:31:10.2791934Z   pluggy             conda-forge/noarch::pluggy-1.5.0-pyhd8ed1ab_1 
2025-05-07T20:31:10.2792353Z   pytest             conda-forge/noarch::pytest-8.3.5-pyhd8ed1ab_0 
2025-05-07T20:31:10.2792752Z   tomli              conda-forge/noarch::tomli-2.2.1-pyhd8ed1ab_1 
2025-05-07T20:31:10.2793007Z 
2025-05-07T20:31:10.2793011Z 
2025-05-07T20:31:10.2793015Z 
2025-05-07T20:31:10.2793153Z Downloading and Extracting Packages: ...working...
2025-05-07T20:31:10.2793508Z pytest-8.3.5         | 254 KB    |            |   0% 
2025-05-07T20:31:10.2793808Z 
2025-05-07T20:31:10.2794240Z packaging-25.0       | 61 KB     |            |   0% [A
2025-05-07T20:31:10.2794472Z 
2025-05-07T20:31:10.2794476Z 
2025-05-07T20:31:10.2804503Z colorama-0.4.6       | 26 KB     |            |   0% [A[A
2025-05-07T20:31:10.2804837Z 
2025-05-07T20:31:10.2804844Z 
2025-05-07T20:31:10.2809097Z 
2025-05-07T20:31:10.2819306Z pluggy-1.5.0         | 23 KB     |            |   0% [A[A[A
2025-05-07T20:31:10.2819912Z 
2025-05-07T20:31:10.2819917Z 
2025-05-07T20:31:10.2819921Z 
2025-05-07T20:31:10.2825092Z 
2025-05-07T20:31:10.2829641Z exceptiongroup-1.2.2 | 20 KB     |            |   0% [A[A[A[A
2025-05-07T20:31:10.2830031Z 
2025-05-07T20:31:10.2830036Z 
2025-05-07T20:31:10.2830040Z 
2025-05-07T20:31:10.2830044Z 
2025-05-07T20:31:10.2830047Z 
2025-05-07T20:31:10.2834447Z tomli-2.2.1          | 19 KB     |            |   0% [A[A[A[A[A
2025-05-07T20:31:10.2834714Z 
2025-05-07T20:31:10.2834718Z 
2025-05-07T20:31:10.2834722Z 
2025-05-07T20:31:10.2834726Z 
2025-05-07T20:31:10.2834730Z 
2025-05-07T20:31:10.2841691Z 
2025-05-07T20:31:10.2843461Z expecttest-0.3.0     | 14 KB     |            |   0% [A[A[A[A[A[A
2025-05-07T20:31:10.2843803Z 
2025-05-07T20:31:10.2843807Z 
2025-05-07T20:31:10.2843811Z 
2025-05-07T20:31:10.2843815Z 
2025-05-07T20:31:10.2843819Z 
2025-05-07T20:31:10.2843822Z 
2025-05-07T20:31:10.2843826Z 
2025-05-07T20:31:10.3834010Z iniconfig-2.0.0      | 11 KB     |            |   0% [A[A[A[A[A[A[A
2025-05-07T20:31:10.3834570Z 
2025-05-07T20:31:10.3834579Z 
2025-05-07T20:31:10.3938165Z 
2025-05-07T20:31:10.3938775Z pluggy-1.5.0         | 23 KB     | ########## | 100% [A[A[A
2025-05-07T20:31:10.3939046Z 
2025-05-07T20:31:10.3939051Z 
2025-05-07T20:31:10.3939055Z 
2025-05-07T20:31:10.4990536Z pluggy-1.5.0         | 23 KB     | ########## | 100% [A[A[A
2025-05-07T20:31:10.4990810Z 
2025-05-07T20:31:10.4990814Z 
2025-05-07T20:31:10.4990818Z 
2025-05-07T20:31:10.4990822Z 
2025-05-07T20:31:10.4990826Z 
2025-05-07T20:31:10.5019944Z tomli-2.2.1          | 19 KB     | ########5  |  85% [A[A[A[A[A
2025-05-07T20:31:10.5020207Z 
2025-05-07T20:31:10.5020211Z 
2025-05-07T20:31:10.5020215Z 
2025-05-07T20:31:10.5020219Z 
2025-05-07T20:31:10.5022056Z 
2025-05-07T20:31:10.5830828Z tomli-2.2.1          | 19 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:31:10.5847447Z pytest-8.3.5         | 254 KB    | 6          |   6% 
2025-05-07T20:31:10.5847701Z 
2025-05-07T20:31:10.5847706Z 
2025-05-07T20:31:10.5847719Z 
2025-05-07T20:31:10.5847723Z 
2025-05-07T20:31:10.5847727Z 
2025-05-07T20:31:10.5847731Z 
2025-05-07T20:31:10.5853938Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:31:10.5854227Z 
2025-05-07T20:31:10.5854231Z 
2025-05-07T20:31:10.5927544Z colorama-0.4.6       | 26 KB     | ######     |  61% [A[A
2025-05-07T20:31:10.5927899Z 
2025-05-07T20:31:10.5927904Z 
2025-05-07T20:31:10.5927908Z 
2025-05-07T20:31:10.5927913Z 
2025-05-07T20:31:10.5927918Z 
2025-05-07T20:31:10.5931112Z 
2025-05-07T20:31:10.6274882Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:31:10.6275161Z 
2025-05-07T20:31:10.6275176Z 
2025-05-07T20:31:10.6651086Z colorama-0.4.6       | 26 KB     | ########## | 100% [A[A
2025-05-07T20:31:10.6767054Z pytest-8.3.5         | 254 KB    | ########## | 100% 
2025-05-07T20:31:10.6767513Z 
2025-05-07T20:31:10.6767521Z 
2025-05-07T20:31:10.6767529Z 
2025-05-07T20:31:10.6767537Z 
2025-05-07T20:31:10.6767545Z 
2025-05-07T20:31:10.6782744Z tomli-2.2.1          | 19 KB     | ########## | 100% [A[A[A[A[A
2025-05-07T20:31:10.6782997Z 
2025-05-07T20:31:10.6783002Z 
2025-05-07T20:31:10.6783281Z 
2025-05-07T20:31:10.6805106Z pluggy-1.5.0         | 23 KB     | ########## | 100% [A[A[A
2025-05-07T20:31:10.6805489Z 
2025-05-07T20:31:10.6805495Z 
2025-05-07T20:31:10.6805500Z 
2025-05-07T20:31:10.6805505Z 
2025-05-07T20:31:10.6805510Z 
2025-05-07T20:31:10.6805515Z 
2025-05-07T20:31:10.6808463Z 
2025-05-07T20:31:10.6816143Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:31:10.6816422Z 
2025-05-07T20:31:10.6816426Z 
2025-05-07T20:31:10.6816430Z 
2025-05-07T20:31:10.6816434Z 
2025-05-07T20:31:10.6816437Z 
2025-05-07T20:31:10.6816441Z 
2025-05-07T20:31:10.6816743Z 
2025-05-07T20:31:10.6823939Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:31:10.6824237Z 
2025-05-07T20:31:10.6824242Z 
2025-05-07T20:31:10.6824246Z 
2025-05-07T20:31:10.6824257Z 
2025-05-07T20:31:10.6824492Z 
2025-05-07T20:31:10.6824496Z 
2025-05-07T20:31:10.6999885Z expecttest-0.3.0     | 14 KB     | ########## | 100% [A[A[A[A[A[A
2025-05-07T20:31:10.7003282Z 
2025-05-07T20:31:10.7072287Z packaging-25.0       | 61 KB     | ##6        |  26% [A
2025-05-07T20:31:10.7073101Z 
2025-05-07T20:31:10.7119466Z packaging-25.0       | 61 KB     | ########## | 100% [A
2025-05-07T20:31:10.7119702Z 
2025-05-07T20:31:10.7119706Z 
2025-05-07T20:31:10.7119710Z 
2025-05-07T20:31:10.7119714Z 
2025-05-07T20:31:10.7119718Z 
2025-05-07T20:31:10.7119722Z 
2025-05-07T20:31:10.7120660Z 
2025-05-07T20:31:10.7153401Z iniconfig-2.0.0      | 11 KB     | ########## | 100% [A[A[A[A[A[A[A
2025-05-07T20:31:10.7153667Z 
2025-05-07T20:31:10.7153670Z 
2025-05-07T20:31:10.7153674Z 
2025-05-07T20:31:10.7154390Z 
2025-05-07T20:31:10.7183761Z exceptiongroup-1.2.2 | 20 KB     | #######9   |  80% [A[A[A[A
2025-05-07T20:31:10.7184049Z 
2025-05-07T20:31:10.7184053Z 
2025-05-07T20:31:10.7184056Z 
2025-05-07T20:31:10.7184069Z 
2025-05-07T20:31:10.7272524Z exceptiongroup-1.2.2 | 20 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:31:10.7272797Z 
2025-05-07T20:31:10.7276035Z 
2025-05-07T20:31:10.7279104Z colorama-0.4.6       | 26 KB     | ########## | 100% [A[A
2025-05-07T20:31:10.7279363Z 
2025-05-07T20:31:10.7282679Z 
2025-05-07T20:31:10.7476124Z colorama-0.4.6       | 26 KB     | ########## | 100% [A[A
2025-05-07T20:31:10.7476432Z 
2025-05-07T20:31:10.7549935Z packaging-25.0       | 61 KB     | ########## | 100% [A
2025-05-07T20:31:10.7550184Z 
2025-05-07T20:31:10.7550188Z 
2025-05-07T20:31:10.7550191Z 
2025-05-07T20:31:10.7550195Z 
2025-05-07T20:31:10.7643098Z exceptiongroup-1.2.2 | 20 KB     | ########## | 100% [A[A[A[A
2025-05-07T20:31:10.7643557Z pytest-8.3.5         | 254 KB    | ########## | 100% 
2025-05-07T20:31:10.7650942Z pytest-8.3.5         | 254 KB    | ########## | 100% 
2025-05-07T20:31:10.7651412Z                                                      
2025-05-07T20:31:10.7651683Z 
2025-05-07T20:31:10.7651963Z                                                      [A
2025-05-07T20:31:10.7652236Z 
2025-05-07T20:31:10.7652242Z 
2025-05-07T20:31:10.7652776Z                                                      [A[A
2025-05-07T20:31:10.7653052Z 
2025-05-07T20:31:10.7653058Z 
2025-05-07T20:31:10.7653063Z 
2025-05-07T20:31:10.7653273Z                                                      [A[A[A
2025-05-07T20:31:10.7653532Z 
2025-05-07T20:31:10.7653538Z 
2025-05-07T20:31:10.7653551Z 
2025-05-07T20:31:10.7653557Z 
2025-05-07T20:31:10.7653793Z                                                      [A[A[A[A
2025-05-07T20:31:10.7653996Z 
2025-05-07T20:31:10.7654000Z 
2025-05-07T20:31:10.7654004Z 
2025-05-07T20:31:10.7654007Z 
2025-05-07T20:31:10.7654011Z 
2025-05-07T20:31:10.7654190Z                                                      [A[A[A[A[A
2025-05-07T20:31:10.7654403Z 
2025-05-07T20:31:10.7654406Z 
2025-05-07T20:31:10.7654410Z 
2025-05-07T20:31:10.7654414Z 
2025-05-07T20:31:10.7654423Z 
2025-05-07T20:31:10.7654434Z 
2025-05-07T20:31:10.7654610Z                                                      [A[A[A[A[A[A
2025-05-07T20:31:10.7654818Z 
2025-05-07T20:31:10.7654822Z 
2025-05-07T20:31:10.7654832Z 
2025-05-07T20:31:10.7654836Z 
2025-05-07T20:31:10.7654839Z 
2025-05-07T20:31:10.7654843Z 
2025-05-07T20:31:10.7654853Z 
2025-05-07T20:31:10.7655033Z                                                      [A[A[A[A[A[A[A done
2025-05-07T20:31:10.8665002Z Preparing transaction: \ done
2025-05-07T20:31:10.9670318Z Verifying transaction: / done
2025-05-07T20:31:12.8697087Z Executing transaction: \ | / - \ | / - \ | / - \ | / - \ | / done
2025-05-07T20:31:12.9974241Z [TEST] Checking imports ...
2025-05-07T20:31:16.9416544Z [CHECK] Python (sub-)package 'fbgemm_gpu' found ...
2025-05-07T20:31:16.9429589Z [TEST] Setting feature flags ...
2025-05-07T20:31:16.9430007Z + conda env config vars set -n build_binary FBGEMM_TBE_ENSEMBLE_ROWWISE_ADAGRAD=1
2025-05-07T20:31:16.9430693Z 
2025-05-07T20:31:17.3638730Z 
2025-05-07T20:31:17.3639333Z [TEST] PyTest args:  -v -rsx -s -W ignore::pytest.PytestCollectionWarning
2025-05-07T20:31:17.3641359Z ################################################################################
2025-05-07T20:31:17.3641673Z # Run FBGEMM-GPU Tests: 
2025-05-07T20:31:17.3641905Z #
2025-05-07T20:31:17.3661272Z # [2025-05-07T20:31:17.365Z] + __run_fbgemm_gpu_tests_in_directory build_binary
2025-05-07T20:31:17.3661678Z ################################################################################
2025-05-07T20:31:17.3661886Z 
2025-05-07T20:31:17.3669203Z [TEST] Enumerating ALL test files ...
2025-05-07T20:31:17.3698234Z ./attention/gqa_test.py
2025-05-07T20:31:17.3698499Z ./coalesce/coalesce_test.py
2025-05-07T20:31:17.3698766Z ./comm/multi_gpu_car_test.py
2025-05-07T20:31:17.3699027Z ./gather_scatter/gather_scatter_test.py
2025-05-07T20:31:17.3699315Z ./kv_cache/kv_cache_test.py
2025-05-07T20:31:17.3699564Z ./moe/activation_test.py
2025-05-07T20:31:17.3699818Z ./moe/gather_scatter_test.py
2025-05-07T20:31:17.3700062Z ./moe/layers_test.py
2025-05-07T20:31:17.3700288Z ./moe/shuffling_test.py
2025-05-07T20:31:17.3700521Z ./quantize/quantize_test.py
2025-05-07T20:31:17.3700683Z 
2025-05-07T20:31:17.3700794Z [TEST] Enumerating IGNORED test files ...
2025-05-07T20:31:17.3701002Z 
2025-05-07T20:31:17.3719424Z ################################################################################
2025-05-07T20:31:17.3735456Z # [2025-05-07T20:31:17.373Z] Run Python Test Suite:
2025-05-07T20:31:17.3735782Z #   ./attention/gqa_test.py
2025-05-07T20:31:17.3736058Z ################################################################################
2025-05-07T20:31:17.3761379Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./attention/gqa_test.py
2025-05-07T20:31:17.3761995Z 
2025-05-07T20:31:19.9113672Z ============================= test session starts ==============================
2025-05-07T20:31:19.9114562Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:31:19.9115399Z cachedir: .pytest_cache
2025-05-07T20:31:19.9115971Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:31:19.9116676Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:31:19.9117096Z plugins: hypothesis-6.131.14
2025-05-07T20:31:21.5097541Z collecting ... collected 2 items
2025-05-07T20:31:21.5097835Z 
2025-05-07T20:31:59.0796049Z attention/gqa_test.py::Int4GQATest::test_gqa Trying example: test_gqa(
2025-05-07T20:31:59.0796649Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:59.0797066Z     int4_kv=False,
2025-05-07T20:31:59.0797319Z     num_groups=1,
2025-05-07T20:31:59.0797566Z     B=1,
2025-05-07T20:31:59.0799308Z     MAX_T=4,
2025-05-07T20:31:59.0799592Z     N_H_L=1,
2025-05-07T20:31:59.0799849Z )
2025-05-07T20:31:59.0800085Z Trying example: test_gqa(
2025-05-07T20:31:59.0800444Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:59.0800838Z     int4_kv=True,
2025-05-07T20:31:59.0801082Z     num_groups=1,
2025-05-07T20:31:59.0801326Z     B=1,
2025-05-07T20:31:59.0801546Z     MAX_T=4,
2025-05-07T20:31:59.0801767Z     N_H_L=1,
2025-05-07T20:31:59.0801991Z )
2025-05-07T20:31:59.0802217Z Trying example: test_gqa(
2025-05-07T20:31:59.0802556Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:59.0802930Z     int4_kv=True,
2025-05-07T20:31:59.0803174Z     num_groups=4,
2025-05-07T20:31:59.0803411Z     B=23,
2025-05-07T20:31:59.0803643Z     MAX_T=33,
2025-05-07T20:31:59.0803875Z     N_H_L=68,
2025-05-07T20:31:59.0804093Z )
2025-05-07T20:31:59.0804327Z Trying example: test_gqa(
2025-05-07T20:31:59.0804668Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:59.0805030Z     int4_kv=True,
2025-05-07T20:31:59.0805281Z     num_groups=4,
2025-05-07T20:31:59.0805931Z     B=77,
2025-05-07T20:31:59.0806147Z     MAX_T=4,
2025-05-07T20:31:59.0806378Z     N_H_L=1,
2025-05-07T20:31:59.0806604Z )
2025-05-07T20:31:59.0806836Z Trying example: test_gqa(
2025-05-07T20:31:59.0807184Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:59.0807555Z     int4_kv=True,
2025-05-07T20:31:59.0807793Z     num_groups=4,
2025-05-07T20:31:59.0808038Z     B=77,
2025-05-07T20:31:59.0808259Z     MAX_T=52,
2025-05-07T20:31:59.0808484Z     N_H_L=67,
2025-05-07T20:31:59.0808709Z )
2025-05-07T20:31:59.0808935Z Trying example: test_gqa(
2025-05-07T20:31:59.0809275Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:59.0809646Z     int4_kv=False,
2025-05-07T20:31:59.0809893Z     num_groups=4,
2025-05-07T20:31:59.0810136Z     B=57,
2025-05-07T20:31:59.0810351Z     MAX_T=45,
2025-05-07T20:31:59.0810585Z     N_H_L=120,
2025-05-07T20:31:59.0810823Z )
2025-05-07T20:31:59.0811049Z Trying example: test_gqa(
2025-05-07T20:31:59.0811393Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:59.0811778Z     int4_kv=True,
2025-05-07T20:31:59.0812027Z     num_groups=4,
2025-05-07T20:31:59.0812278Z     B=52,
2025-05-07T20:31:59.0812509Z     MAX_T=42,
2025-05-07T20:31:59.0812733Z     N_H_L=53,
2025-05-07T20:31:59.0813026Z )
2025-05-07T20:31:59.0813261Z Trying example: test_gqa(
2025-05-07T20:31:59.0813601Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:59.0813974Z     int4_kv=True,
2025-05-07T20:31:59.0814232Z     num_groups=1,
2025-05-07T20:31:59.0814473Z     B=77,
2025-05-07T20:31:59.0814695Z     MAX_T=95,
2025-05-07T20:31:59.0814921Z     N_H_L=53,
2025-05-07T20:31:59.0815140Z )
2025-05-07T20:31:59.0815367Z Trying example: test_gqa(
2025-05-07T20:31:59.0815710Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:59.0816077Z     int4_kv=True,
2025-05-07T20:31:59.0816318Z     num_groups=4,
2025-05-07T20:31:59.0816560Z     B=113,
2025-05-07T20:31:59.0816784Z     MAX_T=48,
2025-05-07T20:31:59.0817013Z     N_H_L=96,
2025-05-07T20:31:59.0817290Z )
2025-05-07T20:31:59.0817526Z Trying example: test_gqa(
2025-05-07T20:31:59.0817862Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:59.0818443Z     int4_kv=False,
2025-05-07T20:31:59.0818700Z     num_groups=1,
2025-05-07T20:31:59.0818937Z     B=51,
2025-05-07T20:31:59.0819162Z     MAX_T=61,
2025-05-07T20:31:59.0819397Z     N_H_L=69,
2025-05-07T20:31:59.0819621Z )
2025-05-07T20:31:59.0819856Z Trying example: test_gqa(
2025-05-07T20:31:59.0820200Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:59.0820566Z     int4_kv=False,
2025-05-07T20:31:59.0820822Z     num_groups=4,
2025-05-07T20:31:59.0821066Z     B=17,
2025-05-07T20:31:59.0821280Z     MAX_T=113,
2025-05-07T20:31:59.0821516Z     N_H_L=65,
2025-05-07T20:31:59.0821740Z )
2025-05-07T20:31:59.0821957Z Trying example: test_gqa(
2025-05-07T20:31:59.0822299Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:59.0822671Z     int4_kv=False,
2025-05-07T20:31:59.0822918Z     num_groups=4,
2025-05-07T20:31:59.0823165Z     B=17,
2025-05-07T20:31:59.0823386Z     MAX_T=65,
2025-05-07T20:31:59.0823618Z     N_H_L=65,
2025-05-07T20:31:59.0823853Z )
2025-05-07T20:31:59.0824092Z Trying example: test_gqa(
2025-05-07T20:31:59.0824428Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:59.0824810Z     int4_kv=False,
2025-05-07T20:31:59.0825066Z     num_groups=4,
2025-05-07T20:31:59.0825319Z     B=65,
2025-05-07T20:31:59.0825533Z     MAX_T=65,
2025-05-07T20:31:59.0825762Z     N_H_L=65,
2025-05-07T20:31:59.0825986Z )
2025-05-07T20:31:59.0826202Z Trying example: test_gqa(
2025-05-07T20:31:59.0826539Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:59.0826912Z     int4_kv=False,
2025-05-07T20:31:59.0827152Z     num_groups=1,
2025-05-07T20:31:59.0827402Z     B=6,
2025-05-07T20:31:59.0827745Z     MAX_T=108,
2025-05-07T20:31:59.0827970Z     N_H_L=14,
2025-05-07T20:31:59.0828194Z )
2025-05-07T20:31:59.0828422Z Trying example: test_gqa(
2025-05-07T20:31:59.0828855Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:59.0829230Z     int4_kv=False,
2025-05-07T20:31:59.0829481Z     num_groups=1,
2025-05-07T20:31:59.0829721Z     B=6,
2025-05-07T20:31:59.0829946Z     MAX_T=14,
2025-05-07T20:31:59.0830182Z     N_H_L=14,
2025-05-07T20:31:59.0830400Z )
2025-05-07T20:31:59.0830628Z Trying example: test_gqa(
2025-05-07T20:31:59.0830974Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:59.0831340Z     int4_kv=False,
2025-05-07T20:31:59.0831598Z     num_groups=1,
2025-05-07T20:31:59.0831842Z     B=6,
2025-05-07T20:31:59.0832057Z     MAX_T=6,
2025-05-07T20:31:59.0832291Z     N_H_L=14,
2025-05-07T20:31:59.0832520Z )
2025-05-07T20:31:59.0832743Z Trying example: test_gqa(
2025-05-07T20:31:59.0833084Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:59.0833453Z     int4_kv=False,
2025-05-07T20:31:59.0833704Z     num_groups=1,
2025-05-07T20:31:59.0833943Z     B=6,
2025-05-07T20:31:59.0834158Z     MAX_T=6,
2025-05-07T20:31:59.0834397Z     N_H_L=6,
2025-05-07T20:31:59.0834611Z )
2025-05-07T20:31:59.0834836Z Trying example: test_gqa(
2025-05-07T20:31:59.0835183Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:59.0835549Z     int4_kv=False,
2025-05-07T20:31:59.0835796Z     num_groups=1,
2025-05-07T20:31:59.0836038Z     B=70,
2025-05-07T20:31:59.0836255Z     MAX_T=94,
2025-05-07T20:31:59.0836490Z     N_H_L=78,
2025-05-07T20:31:59.0836720Z )
2025-05-07T20:31:59.0836943Z Trying example: test_gqa(
2025-05-07T20:31:59.0837289Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:59.0837662Z     int4_kv=False,
2025-05-07T20:31:59.0837901Z     num_groups=1,
2025-05-07T20:31:59.0838147Z     B=78,
2025-05-07T20:31:59.0838378Z     MAX_T=94,
2025-05-07T20:31:59.0838600Z     N_H_L=78,
2025-05-07T20:31:59.0838827Z )
2025-05-07T20:31:59.0839056Z Trying example: test_gqa(
2025-05-07T20:31:59.0839391Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:59.0839770Z     int4_kv=False,
2025-05-07T20:31:59.0840031Z     num_groups=1,
2025-05-07T20:31:59.0840533Z     B=94,
2025-05-07T20:31:59.0840754Z     MAX_T=94,
2025-05-07T20:31:59.0840983Z     N_H_L=78,
2025-05-07T20:31:59.0841355Z )
2025-05-07T20:31:59.0841586Z Trying example: test_gqa(
2025-05-07T20:31:59.0841924Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:59.0842290Z     int4_kv=False,
2025-05-07T20:31:59.0842541Z     num_groups=1,
2025-05-07T20:31:59.0842778Z     B=94,
2025-05-07T20:31:59.0843017Z     MAX_T=94,
2025-05-07T20:31:59.0843237Z     N_H_L=94,
2025-05-07T20:31:59.0843610Z )
2025-05-07T20:31:59.0844030Z Trying example: test_gqa(
2025-05-07T20:31:59.0844471Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:59.0844950Z     int4_kv=False,
2025-05-07T20:31:59.0854367Z     num_groups=4,
2025-05-07T20:31:59.0854600Z     B=41,
2025-05-07T20:31:59.0854784Z     MAX_T=105,
2025-05-07T20:31:59.0854991Z     N_H_L=126,
2025-05-07T20:31:59.0855189Z )
2025-05-07T20:31:59.0855381Z Trying example: test_gqa(
2025-05-07T20:31:59.0855684Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:59.0855995Z     int4_kv=False,
2025-05-07T20:31:59.0856197Z     num_groups=4,
2025-05-07T20:31:59.0856401Z     B=105,
2025-05-07T20:31:59.0856585Z     MAX_T=105,
2025-05-07T20:31:59.0856782Z     N_H_L=126,
2025-05-07T20:31:59.0856967Z )
2025-05-07T20:31:59.0857157Z Trying example: test_gqa(
2025-05-07T20:31:59.0857447Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:59.0857750Z     int4_kv=False,
2025-05-07T20:31:59.0857958Z     num_groups=4,
2025-05-07T20:31:59.0858159Z     B=105,
2025-05-07T20:31:59.0858339Z     MAX_T=105,
2025-05-07T20:31:59.0858533Z     N_H_L=105,
2025-05-07T20:31:59.0858723Z )
2025-05-07T20:31:59.0858909Z Trying example: test_gqa(
2025-05-07T20:31:59.0859187Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:59.0859492Z     int4_kv=True,
2025-05-07T20:31:59.0859698Z     num_groups=1,
2025-05-07T20:31:59.0859894Z     B=95,
2025-05-07T20:31:59.0860261Z     MAX_T=114,
2025-05-07T20:31:59.0860457Z     N_H_L=43,
2025-05-07T20:31:59.0860640Z )
2025-05-07T20:31:59.0860828Z Trying example: test_gqa(
2025-05-07T20:31:59.0861118Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:59.0861417Z     int4_kv=True,
2025-05-07T20:31:59.0861625Z     num_groups=1,
2025-05-07T20:31:59.0861836Z     B=43,
2025-05-07T20:31:59.0862015Z     MAX_T=114,
2025-05-07T20:31:59.0862208Z     N_H_L=43,
2025-05-07T20:31:59.0862392Z )
2025-05-07T20:31:59.0862574Z Trying example: test_gqa(
2025-05-07T20:31:59.0862851Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:59.0863148Z     int4_kv=True,
2025-05-07T20:31:59.0863349Z     num_groups=1,
2025-05-07T20:31:59.0863555Z     B=43,
2025-05-07T20:31:59.0863741Z     MAX_T=43,
2025-05-07T20:31:59.0863925Z     N_H_L=43,
2025-05-07T20:31:59.0864116Z )
2025-05-07T20:31:59.0864306Z Trying example: test_gqa(
2025-05-07T20:31:59.0864587Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:59.0864903Z     int4_kv=False,
2025-05-07T20:31:59.0865118Z     num_groups=1,
2025-05-07T20:31:59.0865321Z     B=21,
2025-05-07T20:31:59.0865513Z     MAX_T=38,
2025-05-07T20:31:59.0865707Z     N_H_L=42,
2025-05-07T20:31:59.0865891Z )
2025-05-07T20:31:59.0866088Z Trying example: test_gqa(
2025-05-07T20:31:59.0866376Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:59.0866687Z     int4_kv=False,
2025-05-07T20:31:59.0866894Z     num_groups=1,
2025-05-07T20:31:59.0867109Z     B=38,
2025-05-07T20:31:59.0867297Z     MAX_T=38,
2025-05-07T20:31:59.0867588Z     N_H_L=42,
2025-05-07T20:31:59.0867776Z )
2025-05-07T20:31:59.0867965Z Trying example: test_gqa(
2025-05-07T20:31:59.0868244Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:59.0868551Z     int4_kv=False,
2025-05-07T20:31:59.0868768Z     num_groups=1,
2025-05-07T20:31:59.0868968Z     B=38,
2025-05-07T20:31:59.0869148Z     MAX_T=42,
2025-05-07T20:31:59.0869340Z     N_H_L=42,
2025-05-07T20:31:59.0869520Z )
2025-05-07T20:31:59.0869715Z Trying example: test_gqa(
2025-05-07T20:31:59.0869998Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:59.0870298Z     int4_kv=False,
2025-05-07T20:31:59.0870602Z     num_groups=1,
2025-05-07T20:31:59.0870807Z     B=42,
2025-05-07T20:31:59.0870983Z     MAX_T=42,
2025-05-07T20:31:59.0871173Z     N_H_L=42,
2025-05-07T20:31:59.0871362Z )
2025-05-07T20:31:59.0871543Z Trying example: test_gqa(
2025-05-07T20:31:59.0871829Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:59.0872128Z     int4_kv=True,
2025-05-07T20:31:59.0872332Z     num_groups=1,
2025-05-07T20:31:59.0872534Z     B=74,
2025-05-07T20:31:59.0872714Z     MAX_T=20,
2025-05-07T20:31:59.0872897Z     N_H_L=15,
2025-05-07T20:31:59.0873088Z )
2025-05-07T20:31:59.0873276Z Trying example: test_gqa(
2025-05-07T20:31:59.0873559Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:59.0873853Z     int4_kv=True,
2025-05-07T20:31:59.0874063Z     num_groups=1,
2025-05-07T20:31:59.0874275Z     B=20,
2025-05-07T20:31:59.0874458Z     MAX_T=20,
2025-05-07T20:31:59.0874655Z     N_H_L=15,
2025-05-07T20:31:59.0874847Z )
2025-05-07T20:31:59.0875034Z Trying example: test_gqa(
2025-05-07T20:31:59.0875328Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:59.0875634Z     int4_kv=True,
2025-05-07T20:31:59.0875842Z     num_groups=1,
2025-05-07T20:31:59.0876053Z     B=20,
2025-05-07T20:31:59.0876246Z     MAX_T=15,
2025-05-07T20:31:59.0876434Z     N_H_L=15,
2025-05-07T20:31:59.0876629Z )
2025-05-07T20:31:59.0876823Z Trying example: test_gqa(
2025-05-07T20:31:59.0877104Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:59.0877410Z     int4_kv=True,
2025-05-07T20:31:59.0877622Z     num_groups=1,
2025-05-07T20:31:59.0877825Z     B=15,
2025-05-07T20:31:59.0878015Z     MAX_T=20,
2025-05-07T20:31:59.0878218Z     N_H_L=15,
2025-05-07T20:31:59.0878410Z )
2025-05-07T20:31:59.0878605Z Trying example: test_gqa(
2025-05-07T20:31:59.0878894Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:59.0879290Z     int4_kv=True,
2025-05-07T20:31:59.0879504Z     num_groups=1,
2025-05-07T20:31:59.0879709Z     B=15,
2025-05-07T20:31:59.0879900Z     MAX_T=15,
2025-05-07T20:31:59.0880094Z     N_H_L=15,
2025-05-07T20:31:59.0880282Z )
2025-05-07T20:31:59.0880462Z Trying example: test_gqa(
2025-05-07T20:31:59.0880746Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:59.0881050Z     int4_kv=False,
2025-05-07T20:31:59.0881257Z     num_groups=4,
2025-05-07T20:31:59.0881456Z     B=117,
2025-05-07T20:31:59.0881642Z     MAX_T=104,
2025-05-07T20:31:59.0881836Z     N_H_L=69,
2025-05-07T20:31:59.0882018Z )
2025-05-07T20:31:59.0882207Z Trying example: test_gqa(
2025-05-07T20:31:59.0882495Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:59.0882796Z     int4_kv=False,
2025-05-07T20:31:59.0883006Z     num_groups=4,
2025-05-07T20:31:59.0883211Z     B=117,
2025-05-07T20:31:59.0883392Z     MAX_T=117,
2025-05-07T20:31:59.0883587Z     N_H_L=69,
2025-05-07T20:31:59.0883783Z )
2025-05-07T20:31:59.0883969Z Trying example: test_gqa(
2025-05-07T20:31:59.0884259Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:59.0884575Z     int4_kv=False,
2025-05-07T20:31:59.0884775Z     num_groups=4,
2025-05-07T20:31:59.0884984Z     B=69,
2025-05-07T20:31:59.0885177Z     MAX_T=117,
2025-05-07T20:31:59.0885368Z     N_H_L=69,
2025-05-07T20:31:59.0885559Z )
2025-05-07T20:31:59.0885751Z Trying example: test_gqa(
2025-05-07T20:31:59.0886034Z     self=<gqa_test.Int4GQATest testMethod=test_gqa>,
2025-05-07T20:31:59.0886347Z     int4_kv=False,
2025-05-07T20:31:59.0886568Z     num_groups=4,
2025-05-07T20:31:59.0886772Z     B=117,
2025-05-07T20:31:59.0886971Z     MAX_T=69,
2025-05-07T20:31:59.0887171Z     N_H_L=69,
2025-05-07T20:31:59.0887356Z )
2025-05-07T20:31:59.0887541Z PASSED
2025-05-07T20:31:59.0978026Z attention/gqa_test.py::Int4GQATest::test_mqa_main SKIPPED (Skip when...)
2025-05-07T20:31:59.0978345Z 
2025-05-07T20:31:59.0978495Z =========================== short test summary info ============================
2025-05-07T20:31:59.0979363Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/unittest/case.py:154: Skip when CUDA is not available or xformers is not available
2025-05-07T20:31:59.0980052Z ======================== 1 passed, 1 skipped in 39.70s =========================
2025-05-07T20:31:59.7762105Z 
2025-05-07T20:31:59.7762620Z [TEST] Python test suite PASSED: ./attention/gqa_test.py
2025-05-07T20:31:59.7782578Z [TEST] Python test time for ./attention/gqa_test.py: 42 seconds
2025-05-07T20:31:59.7782868Z 
2025-05-07T20:31:59.7782872Z 
2025-05-07T20:31:59.7782876Z 
2025-05-07T20:31:59.7782880Z 
2025-05-07T20:31:59.7803072Z ################################################################################
2025-05-07T20:31:59.7821220Z # [2025-05-07T20:31:59.781Z] Run Python Test Suite:
2025-05-07T20:31:59.7821542Z #   ./coalesce/coalesce_test.py
2025-05-07T20:31:59.7821816Z ################################################################################
2025-05-07T20:31:59.7846101Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./coalesce/coalesce_test.py
2025-05-07T20:31:59.7846749Z 
2025-05-07T20:32:01.9387073Z ============================= test session starts ==============================
2025-05-07T20:32:01.9388353Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:01.9389375Z cachedir: .pytest_cache
2025-05-07T20:32:01.9390485Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:01.9391887Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:01.9392670Z plugins: hypothesis-6.131.14
2025-05-07T20:32:03.5096172Z collecting ... collected 1 item
2025-05-07T20:32:03.5096390Z 
2025-05-07T20:32:04.2624502Z coalesce/coalesce_test.py::CoalesceTest::test_coalesce_batches PASSED
2025-05-07T20:32:04.2625121Z 
2025-05-07T20:32:04.2625265Z ============================== 1 passed in 2.46s ===============================
2025-05-07T20:32:04.9274943Z 
2025-05-07T20:32:04.9275661Z [TEST] Python test suite PASSED: ./coalesce/coalesce_test.py
2025-05-07T20:32:04.9292464Z [TEST] Python test time for ./coalesce/coalesce_test.py: 5 seconds
2025-05-07T20:32:04.9292771Z 
2025-05-07T20:32:04.9292776Z 
2025-05-07T20:32:04.9292792Z 
2025-05-07T20:32:04.9292796Z 
2025-05-07T20:32:04.9312804Z ################################################################################
2025-05-07T20:32:04.9327976Z # [2025-05-07T20:32:04.932Z] Run Python Test Suite:
2025-05-07T20:32:04.9328298Z #   ./comm/multi_gpu_car_test.py
2025-05-07T20:32:04.9328586Z ################################################################################
2025-05-07T20:32:04.9353291Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./comm/multi_gpu_car_test.py
2025-05-07T20:32:04.9353934Z 
2025-05-07T20:32:07.0900231Z ============================= test session starts ==============================
2025-05-07T20:32:07.0900903Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:07.0901442Z cachedir: .pytest_cache
2025-05-07T20:32:07.0901997Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:07.0902811Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:07.0903216Z plugins: hypothesis-6.131.14
2025-05-07T20:32:08.7064014Z collecting ... collected 5 items
2025-05-07T20:32:08.7064441Z 
2025-05-07T20:32:08.7074109Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather SKIPPED
2025-05-07T20:32:08.7081798Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allgather_dtype_mismatch SKIPPED
2025-05-07T20:32:08.7088391Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_allreduce SKIPPED
2025-05-07T20:32:08.7099324Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_oneshot_car_stress SKIPPED
2025-05-07T20:32:08.7113738Z comm/multi_gpu_car_test.py::LLamaMultiGpuTests::test_reducescatter SKIPPED
2025-05-07T20:32:08.7114074Z 
2025-05-07T20:32:08.7114223Z =========================== short test summary info ============================
2025-05-07T20:32:08.7114893Z SKIPPED [1] comm/multi_gpu_car_test.py:310: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:32:08.7115807Z SKIPPED [1] comm/multi_gpu_car_test.py:351: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:32:08.7116706Z SKIPPED [1] comm/multi_gpu_car_test.py:418: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:32:08.7117609Z SKIPPED [1] comm/multi_gpu_car_test.py:434: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:32:08.7118519Z SKIPPED [1] comm/multi_gpu_car_test.py:402: Skip when CUDA is not available or when there are not enough GPUs; these tests require at least two GPUs
2025-05-07T20:32:08.7119153Z ============================== 5 skipped in 1.75s ==============================
2025-05-07T20:32:09.3044034Z 
2025-05-07T20:32:09.3044488Z [TEST] Python test suite PASSED: ./comm/multi_gpu_car_test.py
2025-05-07T20:32:09.3063389Z [TEST] Python test time for ./comm/multi_gpu_car_test.py: 5 seconds
2025-05-07T20:32:09.3063674Z 
2025-05-07T20:32:09.3063679Z 
2025-05-07T20:32:09.3063684Z 
2025-05-07T20:32:09.3063687Z 
2025-05-07T20:32:09.3084975Z ################################################################################
2025-05-07T20:32:09.3102331Z # [2025-05-07T20:32:09.309Z] Run Python Test Suite:
2025-05-07T20:32:09.3102668Z #   ./gather_scatter/gather_scatter_test.py
2025-05-07T20:32:09.3103527Z ################################################################################
2025-05-07T20:32:09.3127628Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./gather_scatter/gather_scatter_test.py
2025-05-07T20:32:09.3128533Z 
2025-05-07T20:32:11.4666614Z ============================= test session starts ==============================
2025-05-07T20:32:11.4667967Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:11.4669004Z cachedir: .pytest_cache
2025-05-07T20:32:11.4669837Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:11.4670584Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:11.4670982Z plugins: hypothesis-6.131.14
2025-05-07T20:32:13.1249571Z collecting ... collected 2 items
2025-05-07T20:32:13.1249799Z 
2025-05-07T20:32:13.1259609Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_gather_along_first_dim SKIPPED
2025-05-07T20:32:13.1274160Z gather_scatter/gather_scatter_test.py::GatherScatterTests::test_scatter_add_along_first_dim SKIPPED
2025-05-07T20:32:13.1274609Z 
2025-05-07T20:32:13.1274757Z =========================== short test summary info ============================
2025-05-07T20:32:13.1275396Z SKIPPED [1] gather_scatter/gather_scatter_test.py:29: Skip when no Hopper GPU is available. This test is only for Hopper GPU.
2025-05-07T20:32:13.1276231Z SKIPPED [1] gather_scatter/gather_scatter_test.py:70: Skip when no Hopper GPU is available. This test is only for Hopper GPU.
2025-05-07T20:32:13.1276817Z ============================== 2 skipped in 1.79s ==============================
2025-05-07T20:32:13.7408108Z 
2025-05-07T20:32:13.7409019Z [TEST] Python test suite PASSED: ./gather_scatter/gather_scatter_test.py
2025-05-07T20:32:13.7428842Z [TEST] Python test time for ./gather_scatter/gather_scatter_test.py: 4 seconds
2025-05-07T20:32:13.7429198Z 
2025-05-07T20:32:13.7429202Z 
2025-05-07T20:32:13.7429206Z 
2025-05-07T20:32:13.7429581Z 
2025-05-07T20:32:13.7451457Z ################################################################################
2025-05-07T20:32:13.7466943Z # [2025-05-07T20:32:13.746Z] Run Python Test Suite:
2025-05-07T20:32:13.7467271Z #   ./kv_cache/kv_cache_test.py
2025-05-07T20:32:13.7467653Z ################################################################################
2025-05-07T20:32:13.7492455Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./kv_cache/kv_cache_test.py
2025-05-07T20:32:13.7493305Z 
2025-05-07T20:32:15.9139037Z ============================= test session starts ==============================
2025-05-07T20:32:15.9139671Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:15.9140581Z cachedir: .pytest_cache
2025-05-07T20:32:15.9141236Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:15.9141986Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:15.9142385Z plugins: hypothesis-6.131.14
2025-05-07T20:32:17.5197629Z collecting ... collected 4 items
2025-05-07T20:32:17.5197932Z 
2025-05-07T20:32:20.0071914Z kv_cache/kv_cache_test.py::KVCacheTests::test_fp8_kv_cache SKIPPED (...)
2025-05-07T20:32:20.0153068Z kv_cache/kv_cache_test.py::KVCacheTests::test_int4_kv_cache SKIPPED
2025-05-07T20:32:20.0243794Z kv_cache/kv_cache_test.py::KVCacheTests::test_positional_encoding_with_paged_attention SKIPPED
2025-05-07T20:32:20.0328558Z kv_cache/kv_cache_test.py::KVCacheTests::test_rope_positional_encoding_only SKIPPED
2025-05-07T20:32:20.0329050Z 
2025-05-07T20:32:20.0329254Z =========================== short test summary info ============================
2025-05-07T20:32:20.0330398Z SKIPPED [1] ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/unittest/case.py:154: Skip when H100 is not available or MI300 is not available
2025-05-07T20:32:20.0331332Z SKIPPED [3] ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/unittest/case.py:154: Skip when xformers is not available
2025-05-07T20:32:20.0331946Z ============================== 4 skipped in 4.25s ==============================
2025-05-07T20:32:22.1256559Z 
2025-05-07T20:32:22.1257372Z [TEST] Python test suite PASSED: ./kv_cache/kv_cache_test.py
2025-05-07T20:32:22.1276326Z [TEST] Python test time for ./kv_cache/kv_cache_test.py: 9 seconds
2025-05-07T20:32:22.1276610Z 
2025-05-07T20:32:22.1276615Z 
2025-05-07T20:32:22.1276619Z 
2025-05-07T20:32:22.1276622Z 
2025-05-07T20:32:22.1296965Z ################################################################################
2025-05-07T20:32:22.1311953Z # [2025-05-07T20:32:22.130Z] Run Python Test Suite:
2025-05-07T20:32:22.1312407Z #   ./moe/activation_test.py
2025-05-07T20:32:22.1312805Z ################################################################################
2025-05-07T20:32:22.1338364Z + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py
2025-05-07T20:32:22.1338989Z 
2025-05-07T20:32:24.2955570Z ============================= test session starts ==============================
2025-05-07T20:32:24.2956530Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:24.2957417Z cachedir: .pytest_cache
2025-05-07T20:32:24.2958394Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:24.2959565Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:24.2960240Z plugins: hypothesis-6.131.14
2025-05-07T20:32:25.8988336Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:32:25.9950017Z collecting ... collected 2 items
2025-05-07T20:32:25.9950220Z 
2025-05-07T20:32:31.0647525Z moe/activation_test.py::ActivationTests::test_silu_mul Trying example: test_silu_mul(
2025-05-07T20:32:31.0648148Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.0648529Z     T=1,
2025-05-07T20:32:31.0648719Z     D=5120,
2025-05-07T20:32:31.0648914Z     contiguous=True,
2025-05-07T20:32:31.0649131Z     compiled=True,
2025-05-07T20:32:31.0649338Z )
2025-05-07T20:32:31.0649532Z Trying example: test_silu_mul(
2025-05-07T20:32:31.0649897Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.0650276Z     T=4096,
2025-05-07T20:32:31.0650465Z     D=5120,
2025-05-07T20:32:31.0650648Z     contiguous=True,
2025-05-07T20:32:31.0650867Z     compiled=True,
2025-05-07T20:32:31.0651067Z )
2025-05-07T20:32:31.0651261Z Trying example: test_silu_mul(
2025-05-07T20:32:31.0651641Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.0652018Z     T=4096,
2025-05-07T20:32:31.0652199Z     D=7168,
2025-05-07T20:32:31.0652393Z     contiguous=False,
2025-05-07T20:32:31.0652613Z     compiled=False,
2025-05-07T20:32:31.0652809Z )
2025-05-07T20:32:31.0653002Z Trying example: test_silu_mul(
2025-05-07T20:32:31.0653369Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.0653737Z     T=4096,
2025-05-07T20:32:31.0655688Z     D=5120,
2025-05-07T20:32:31.0655891Z     contiguous=False,
2025-05-07T20:32:31.0656113Z     compiled=True,
2025-05-07T20:32:31.0656306Z )
2025-05-07T20:32:31.0656499Z Trying example: test_silu_mul(
2025-05-07T20:32:31.0656869Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.0657248Z     T=1,
2025-05-07T20:32:31.0657427Z     D=7168,
2025-05-07T20:32:31.0657615Z     contiguous=True,
2025-05-07T20:32:31.0658011Z     compiled=True,
2025-05-07T20:32:31.0658208Z )
2025-05-07T20:32:31.0658399Z Trying example: test_silu_mul(
2025-05-07T20:32:31.0658766Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.0659132Z     T=1,
2025-05-07T20:32:31.0659310Z     D=7168,
2025-05-07T20:32:31.0659492Z     contiguous=False,
2025-05-07T20:32:31.0659713Z     compiled=True,
2025-05-07T20:32:31.0659909Z )
2025-05-07T20:32:31.0660096Z Trying example: test_silu_mul(
2025-05-07T20:32:31.0660454Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.0660822Z     T=4096,
2025-05-07T20:32:31.0660999Z     D=5120,
2025-05-07T20:32:31.0661183Z     contiguous=False,
2025-05-07T20:32:31.0661401Z     compiled=False,
2025-05-07T20:32:31.0661599Z )
2025-05-07T20:32:31.0661781Z Trying example: test_silu_mul(
2025-05-07T20:32:31.0662146Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.0662511Z     T=1,
2025-05-07T20:32:31.0662689Z     D=7168,
2025-05-07T20:32:31.0662871Z     contiguous=True,
2025-05-07T20:32:31.0663090Z     compiled=False,
2025-05-07T20:32:31.0663282Z )
2025-05-07T20:32:31.0663475Z Trying example: test_silu_mul(
2025-05-07T20:32:31.0663837Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.0664204Z     T=2048,
2025-05-07T20:32:31.0664383Z     D=5120,
2025-05-07T20:32:31.0664569Z     contiguous=True,
2025-05-07T20:32:31.0664780Z     compiled=True,
2025-05-07T20:32:31.0664978Z )
2025-05-07T20:32:31.0665165Z Trying example: test_silu_mul(
2025-05-07T20:32:31.0665523Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.0665887Z     T=2048,
2025-05-07T20:32:31.0666067Z     D=7168,
2025-05-07T20:32:31.0666249Z     contiguous=True,
2025-05-07T20:32:31.0666459Z     compiled=True,
2025-05-07T20:32:31.0666656Z )
2025-05-07T20:32:31.0666842Z Trying example: test_silu_mul(
2025-05-07T20:32:31.0667197Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.0667657Z     T=2048,
2025-05-07T20:32:31.0667841Z     D=7168,
2025-05-07T20:32:31.0668021Z     contiguous=True,
2025-05-07T20:32:31.0668743Z     compiled=False,
2025-05-07T20:32:31.0668963Z )
2025-05-07T20:32:31.0669146Z Trying example: test_silu_mul(
2025-05-07T20:32:31.0669512Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.0669892Z     T=128,
2025-05-07T20:32:31.0670068Z     D=5120,
2025-05-07T20:32:31.0670260Z     contiguous=False,
2025-05-07T20:32:31.0670615Z     compiled=True,
2025-05-07T20:32:31.0670940Z )
2025-05-07T20:32:31.0671215Z Trying example: test_silu_mul(
2025-05-07T20:32:31.0671666Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.0680203Z     T=128,
2025-05-07T20:32:31.0680404Z     D=5120,
2025-05-07T20:32:31.0680589Z     contiguous=True,
2025-05-07T20:32:31.0680815Z     compiled=True,
2025-05-07T20:32:31.0681020Z )
2025-05-07T20:32:31.0681218Z Trying example: test_silu_mul(
2025-05-07T20:32:31.0681593Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.0681980Z     T=16384,
2025-05-07T20:32:31.0682177Z     D=5120,
2025-05-07T20:32:31.0682365Z     contiguous=False,
2025-05-07T20:32:31.0682584Z     compiled=True,
2025-05-07T20:32:31.0682786Z )
2025-05-07T20:32:31.0682972Z Trying example: test_silu_mul(
2025-05-07T20:32:31.0683345Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.0683715Z     T=16384,
2025-05-07T20:32:31.0683901Z     D=5120,
2025-05-07T20:32:31.0684092Z     contiguous=False,
2025-05-07T20:32:31.0684315Z     compiled=False,
2025-05-07T20:32:31.0684513Z )
2025-05-07T20:32:31.0684708Z Trying example: test_silu_mul(
2025-05-07T20:32:31.0685071Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.0685428Z     T=128,
2025-05-07T20:32:31.0685612Z     D=7168,
2025-05-07T20:32:31.0685798Z     contiguous=True,
2025-05-07T20:32:31.0686121Z     compiled=False,
2025-05-07T20:32:31.0686320Z )
2025-05-07T20:32:31.0686504Z Trying example: test_silu_mul(
2025-05-07T20:32:31.0686863Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.0687225Z     T=128,
2025-05-07T20:32:31.0687407Z     D=7168,
2025-05-07T20:32:31.0687590Z     contiguous=False,
2025-05-07T20:32:31.0687812Z     compiled=False,
2025-05-07T20:32:31.0688014Z )
2025-05-07T20:32:31.0688192Z Trying example: test_silu_mul(
2025-05-07T20:32:31.0688554Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.0688921Z     T=1,
2025-05-07T20:32:31.0689094Z     D=5120,
2025-05-07T20:32:31.0689275Z     contiguous=False,
2025-05-07T20:32:31.0689498Z     compiled=False,
2025-05-07T20:32:31.0689697Z )
2025-05-07T20:32:31.0689878Z Trying example: test_silu_mul(
2025-05-07T20:32:31.0690244Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.0690609Z     T=1,
2025-05-07T20:32:31.0690787Z     D=7168,
2025-05-07T20:32:31.0690978Z     contiguous=False,
2025-05-07T20:32:31.0691196Z     compiled=False,
2025-05-07T20:32:31.0691388Z )
2025-05-07T20:32:31.0691577Z Trying example: test_silu_mul(
2025-05-07T20:32:31.0691936Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.0692296Z     T=4096,
2025-05-07T20:32:31.0692478Z     D=5120,
2025-05-07T20:32:31.0692666Z     contiguous=True,
2025-05-07T20:32:31.0692877Z     compiled=False,
2025-05-07T20:32:31.0693075Z )
2025-05-07T20:32:31.0693264Z Trying example: test_silu_mul(
2025-05-07T20:32:31.0693622Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.0693990Z     T=128,
2025-05-07T20:32:31.0694170Z     D=7168,
2025-05-07T20:32:31.0694356Z     contiguous=True,
2025-05-07T20:32:31.0694566Z     compiled=True,
2025-05-07T20:32:31.0694763Z )
2025-05-07T20:32:31.0694949Z Trying example: test_silu_mul(
2025-05-07T20:32:31.0695304Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.0695676Z     T=1,
2025-05-07T20:32:31.0695857Z     D=5120,
2025-05-07T20:32:31.0696035Z     contiguous=False,
2025-05-07T20:32:31.0696357Z     compiled=True,
2025-05-07T20:32:31.0696555Z )
2025-05-07T20:32:31.0696737Z Trying example: test_silu_mul(
2025-05-07T20:32:31.0697091Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.0697466Z     T=4096,
2025-05-07T20:32:31.0697641Z     D=7168,
2025-05-07T20:32:31.0697823Z     contiguous=True,
2025-05-07T20:32:31.0698037Z     compiled=False,
2025-05-07T20:32:31.0698230Z )
2025-05-07T20:32:31.0698416Z Trying example: test_silu_mul(
2025-05-07T20:32:31.0698776Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.0699134Z     T=4096,
2025-05-07T20:32:31.0699312Z     D=7168,
2025-05-07T20:32:31.0699495Z     contiguous=False,
2025-05-07T20:32:31.0699707Z     compiled=True,
2025-05-07T20:32:31.0699911Z )
2025-05-07T20:32:31.0700106Z Trying example: test_silu_mul(
2025-05-07T20:32:31.0700469Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.0700841Z     T=128,
2025-05-07T20:32:31.0701021Z     D=5120,
2025-05-07T20:32:31.0701208Z     contiguous=True,
2025-05-07T20:32:31.0701416Z     compiled=False,
2025-05-07T20:32:31.0701615Z )
2025-05-07T20:32:31.0701802Z Trying example: test_silu_mul(
2025-05-07T20:32:31.0702153Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.0702516Z     T=128,
2025-05-07T20:32:31.0702691Z     D=5120,
2025-05-07T20:32:31.0702873Z     contiguous=False,
2025-05-07T20:32:31.0703093Z     compiled=False,
2025-05-07T20:32:31.0703287Z )
2025-05-07T20:32:31.0703477Z Trying example: test_silu_mul(
2025-05-07T20:32:31.0703836Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.0704206Z     T=1,
2025-05-07T20:32:31.0704379Z     D=5120,
2025-05-07T20:32:31.0704563Z     contiguous=True,
2025-05-07T20:32:31.0704869Z     compiled=False,
2025-05-07T20:32:31.0705062Z )
2025-05-07T20:32:31.0705256Z Trying example: test_silu_mul(
2025-05-07T20:32:31.0705624Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.0705983Z     T=2048,
2025-05-07T20:32:31.0706162Z     D=7168,
2025-05-07T20:32:31.0706349Z     contiguous=False,
2025-05-07T20:32:31.0706570Z     compiled=True,
2025-05-07T20:32:31.0706765Z )
2025-05-07T20:32:31.0706955Z Trying example: test_silu_mul(
2025-05-07T20:32:31.0707318Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.0707760Z     T=2048,
2025-05-07T20:32:31.0707941Z     D=7168,
2025-05-07T20:32:31.0708124Z     contiguous=False,
2025-05-07T20:32:31.0708339Z     compiled=False,
2025-05-07T20:32:31.0708537Z )
2025-05-07T20:32:31.0708723Z Trying example: test_silu_mul(
2025-05-07T20:32:31.0709078Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.0709448Z     T=16384,
2025-05-07T20:32:31.0709640Z     D=7168,
2025-05-07T20:32:31.0709819Z     contiguous=False,
2025-05-07T20:32:31.0710038Z     compiled=True,
2025-05-07T20:32:31.0710237Z )
2025-05-07T20:32:31.0710425Z Trying example: test_silu_mul(
2025-05-07T20:32:31.0710785Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.0711150Z     T=16384,
2025-05-07T20:32:31.0711335Z     D=7168,
2025-05-07T20:32:31.0711522Z     contiguous=True,
2025-05-07T20:32:31.0711740Z     compiled=True,
2025-05-07T20:32:31.0711930Z )
2025-05-07T20:32:31.0712124Z Trying example: test_silu_mul(
2025-05-07T20:32:31.0712483Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.0712849Z     T=4096,
2025-05-07T20:32:31.0713019Z     D=7168,
2025-05-07T20:32:31.0713204Z     contiguous=True,
2025-05-07T20:32:31.0713419Z     compiled=True,
2025-05-07T20:32:31.0713608Z )
2025-05-07T20:32:31.0713793Z Trying example: test_silu_mul(
2025-05-07T20:32:31.0714151Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.0714511Z     T=2048,
2025-05-07T20:32:31.0714692Z     D=5120,
2025-05-07T20:32:31.0714877Z     contiguous=False,
2025-05-07T20:32:31.0715178Z     compiled=False,
2025-05-07T20:32:31.0715386Z )
2025-05-07T20:32:31.0715576Z Trying example: test_silu_mul(
2025-05-07T20:32:31.0715930Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.0716297Z     T=2048,
2025-05-07T20:32:31.0716479Z     D=5120,
2025-05-07T20:32:31.0716658Z     contiguous=True,
2025-05-07T20:32:31.0716874Z     compiled=False,
2025-05-07T20:32:31.0717074Z )
2025-05-07T20:32:31.0717256Z Trying example: test_silu_mul(
2025-05-07T20:32:31.0717621Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.0717989Z     T=128,
2025-05-07T20:32:31.0718171Z     D=7168,
2025-05-07T20:32:31.0718354Z     contiguous=False,
2025-05-07T20:32:31.0718574Z     compiled=True,
2025-05-07T20:32:31.0718768Z )
2025-05-07T20:32:31.0718954Z Trying example: test_silu_mul(
2025-05-07T20:32:31.0719313Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.0719677Z     T=16384,
2025-05-07T20:32:31.0719860Z     D=5120,
2025-05-07T20:32:31.0720050Z     contiguous=True,
2025-05-07T20:32:31.0720263Z     compiled=True,
2025-05-07T20:32:31.0720455Z )
2025-05-07T20:32:31.0720645Z Trying example: test_silu_mul(
2025-05-07T20:32:31.0721005Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.0721363Z     T=2048,
2025-05-07T20:32:31.0721546Z     D=5120,
2025-05-07T20:32:31.0721733Z     contiguous=False,
2025-05-07T20:32:31.0721942Z     compiled=True,
2025-05-07T20:32:31.0722139Z )
2025-05-07T20:32:31.0722327Z Trying example: test_silu_mul(
2025-05-07T20:32:31.0722682Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.0723046Z     T=16384,
2025-05-07T20:32:31.0723234Z     D=5120,
2025-05-07T20:32:31.0723415Z     contiguous=True,
2025-05-07T20:32:31.0723717Z     compiled=False,
2025-05-07T20:32:31.0723921Z )
2025-05-07T20:32:31.0724101Z Trying example: test_silu_mul(
2025-05-07T20:32:31.0724500Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.0724910Z     T=16384,
2025-05-07T20:32:31.0725090Z     D=7168,
2025-05-07T20:32:31.0725278Z     contiguous=False,
2025-05-07T20:32:31.0725501Z     compiled=False,
2025-05-07T20:32:31.0725695Z )
2025-05-07T20:32:31.0725883Z Trying example: test_silu_mul(
2025-05-07T20:32:31.0726247Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul>,
2025-05-07T20:32:31.0726611Z     T=16384,
2025-05-07T20:32:31.0726793Z     D=7168,
2025-05-07T20:32:31.0726979Z     contiguous=True,
2025-05-07T20:32:31.0727195Z     compiled=False,
2025-05-07T20:32:31.0727390Z )
2025-05-07T20:32:31.0727561Z PASSED
2025-05-07T20:32:31.1329204Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:31.1330594Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:32:31.1331919Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:31.1333401Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:31.1334366Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:31.1335673Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:31.1337240Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.1338519Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:31.1339865Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.1341064Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                        module_map=module_map)
2025-05-07T20:32:31.1342318Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:31.1343543Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:32:31.1344360Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:31.1345533Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:31.1346704Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:32:31.1347920Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:31.1348914Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:32:31.1350102Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:31.1351355Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:31.1352234Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:31.1353307Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:31.1354322Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:32:31.1355073Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:31.1356208Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:31.1357532Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:31.1358564Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.1359581Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.1360308Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:32:31.1361296Z W0507 20:32:31.130000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.1483142Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:31.1484232Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:32:31.1486921Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:31.1489770Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:31.1491650Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:31.1494183Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:31.1495757Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.1497220Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:31.1498564Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.1499578Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                        module_map=module_map)
2025-05-07T20:32:31.1500808Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:31.1502032Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:32:31.1502849Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:31.1504021Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:31.1505207Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:32:31.1506220Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:31.1507221Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:32:31.1508638Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:31.1509934Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:31.1510810Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:31.1511870Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:31.1512888Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:32:31.1513643Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:31.1514779Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:31.1516101Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:31.1517132Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.1518017Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.1518817Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:32:31.1519807Z W0507 20:32:31.146000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.1871061Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:31.1873446Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:32:31.1875205Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:31.1876601Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:31.1877574Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:31.1878860Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:31.1880207Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.1881475Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:31.1883003Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.1884040Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                        module_map=module_map)
2025-05-07T20:32:31.1885263Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:31.1886473Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:32:31.1887294Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:31.1888473Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:31.1889647Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:32:31.1890653Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:31.1891647Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:32:31.1892837Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:31.1894206Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:31.1895078Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:31.1896138Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:31.1897156Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:32:31.1897905Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:31.1899061Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:31.1900380Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:31.1901417Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.1902303Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.1903023Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:32:31.1904087Z W0507 20:32:31.185000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.1918902Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:31.1920040Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] Traceback (most recent call last):
2025-05-07T20:32:31.1921340Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:31.1922723Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:31.1923678Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:31.1924945Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:31.1926295Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.1927563Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:31.1928897Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.1930026Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]                        module_map=module_map)
2025-05-07T20:32:31.1931251Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:31.1932464Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     generator.visit(fn.parse())
2025-05-07T20:32:31.1933281Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:31.1934482Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:31.1935689Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ret = super().visit(node)
2025-05-07T20:32:31.1936689Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:31.1937677Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     return visitor(node)
2025-05-07T20:32:31.1938863Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:31.1940339Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:31.1941686Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:31.1942754Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:31.1943766Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     self.visit(item)
2025-05-07T20:32:31.1944520Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:31.1945665Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:31.1946996Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:31.1948139Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.1949026Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.1949752Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ^
2025-05-07T20:32:31.1950738Z W0507 20:32:31.190000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.5990021Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.5991023Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.5991645Z     T=1,
2025-05-07T20:32:31.5991837Z     D=5120,
2025-05-07T20:32:31.5992032Z     scale_ub=None,
2025-05-07T20:32:31.5992238Z     contiguous=True,
2025-05-07T20:32:31.5992461Z     compiled=True,
2025-05-07T20:32:31.5992668Z )
2025-05-07T20:32:31.5992976Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.5993452Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:31.5993716Z 
2025-05-07T20:32:31.5993802Z     @given(
2025-05-07T20:32:31.5994022Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.5994335Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.5994635Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.5994962Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.5995288Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.5995565Z     )
2025-05-07T20:32:31.5995921Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.5996362Z     def test_silu_mul_quant(
2025-05-07T20:32:31.5996598Z         self,
2025-05-07T20:32:31.5996786Z         T: int,
2025-05-07T20:32:31.5996975Z         D: int,
2025-05-07T20:32:31.5997191Z         scale_ub: Optional[float],
2025-05-07T20:32:31.5997458Z         contiguous: bool,
2025-05-07T20:32:31.5997690Z         compiled: bool,
2025-05-07T20:32:31.5997911Z     ) -> None:
2025-05-07T20:32:31.5998124Z         torch.manual_seed(2025)
2025-05-07T20:32:31.5998355Z     
2025-05-07T20:32:31.5998624Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.5998962Z     
2025-05-07T20:32:31.5999151Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.5999430Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.5999743Z         x = x_sign * x_clamp
2025-05-07T20:32:31.5999982Z         x0 = x[:, :D]
2025-05-07T20:32:31.6000187Z         x1 = x[:, D:]
2025-05-07T20:32:31.6000664Z     
2025-05-07T20:32:31.6000855Z         if contiguous:
2025-05-07T20:32:31.6001082Z             x0 = x0.contiguous()
2025-05-07T20:32:31.6001339Z             x1 = x1.contiguous()
2025-05-07T20:32:31.6001587Z     
2025-05-07T20:32:31.6001768Z         if scale_ub is not None:
2025-05-07T20:32:31.6002041Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.6002370Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.6002673Z             )
2025-05-07T20:32:31.6002856Z         else:
2025-05-07T20:32:31.6003060Z             scale_ub_tensor = None
2025-05-07T20:32:31.6003308Z     
2025-05-07T20:32:31.6003528Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.6012529Z             op = silu_mul_quant
2025-05-07T20:32:31.6012810Z             if compiled:
2025-05-07T20:32:31.6013071Z                 op = torch.compile(op)
2025-05-07T20:32:31.6013384Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.6013665Z     
2025-05-07T20:32:31.6013866Z         y_fp8, y_scale = fn()
2025-05-07T20:32:31.6014144Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:31.6014429Z     
2025-05-07T20:32:31.6014667Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.6014999Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:31.6015284Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:31.6015586Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:31.6015938Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:31.6016245Z     
2025-05-07T20:32:31.6016442Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:31.6016640Z 
2025-05-07T20:32:31.6016741Z moe/activation_test.py:126: 
2025-05-07T20:32:31.6017030Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.6017563Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:31.6017889Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:31.6018665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:31.6019406Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:31.6019938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.6020618Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.6021297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:31.6022013Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:31.6022720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:31.6023378Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:31.6023988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:31.6024490Z     fn()
2025-05-07T20:32:31.6025008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:31.6025578Z     self.fn.run(
2025-05-07T20:32:31.6026057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.6026581Z     kernel = self.compile(
2025-05-07T20:32:31.6027110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.6027886Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.6028272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.6028507Z 
2025-05-07T20:32:31.6028799Z self = <triton.compiler.compiler.ASTSource object at 0x7f13e8212150>
2025-05-07T20:32:31.6029870Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.6031240Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13ea25d4e0>}
2025-05-07T20:32:31.6032571Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.6033576Z context = <triton._C.libtriton.ir.context object at 0x7f13ea2454f0>
2025-05-07T20:32:31.6033868Z 
2025-05-07T20:32:31.6034033Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.6034562Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.6035038Z                            module_map=module_map)
2025-05-07T20:32:31.6035397Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.6035751Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:31.6036014Z E       ^
2025-05-07T20:32:31.6036465Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.6036913Z 
2025-05-07T20:32:31.6037330Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.6037841Z 
2025-05-07T20:32:31.6037940Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:31.6038352Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:31.6038836Z     T=2048,
2025-05-07T20:32:31.6039024Z     D=5120,
2025-05-07T20:32:31.6039217Z     scale_ub=1200.0,
2025-05-07T20:32:31.6039434Z     contiguous=True,
2025-05-07T20:32:31.6039658Z     compiled=False,
2025-05-07T20:32:31.6039866Z )
2025-05-07T20:32:31.6040500Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:31.6041304Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:31.6041581Z 
2025-05-07T20:32:31.6041660Z     @given(
2025-05-07T20:32:31.6041890Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:31.6042193Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:31.6042497Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:31.6042823Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:31.6043140Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:31.6043423Z     )
2025-05-07T20:32:31.6043776Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:31.6044228Z     def test_silu_mul_quant(
2025-05-07T20:32:31.6044482Z         self,
2025-05-07T20:32:31.6044676Z         T: int,
2025-05-07T20:32:31.6044871Z         D: int,
2025-05-07T20:32:31.6045090Z         scale_ub: Optional[float],
2025-05-07T20:32:31.6045358Z         contiguous: bool,
2025-05-07T20:32:31.6045590Z         compiled: bool,
2025-05-07T20:32:31.6045807Z     ) -> None:
2025-05-07T20:32:31.6046021Z         torch.manual_seed(2025)
2025-05-07T20:32:31.6046263Z     
2025-05-07T20:32:31.6046524Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:31.6046880Z     
2025-05-07T20:32:31.6047076Z         x_sign = torch.sign(x)
2025-05-07T20:32:31.6047359Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:31.6047667Z         x = x_sign * x_clamp
2025-05-07T20:32:31.6047910Z         x0 = x[:, :D]
2025-05-07T20:32:31.6048126Z         x1 = x[:, D:]
2025-05-07T20:32:31.6048332Z     
2025-05-07T20:32:31.6048519Z         if contiguous:
2025-05-07T20:32:31.6048752Z             x0 = x0.contiguous()
2025-05-07T20:32:31.6049808Z             x1 = x1.contiguous()
2025-05-07T20:32:31.6050058Z     
2025-05-07T20:32:31.6050250Z         if scale_ub is not None:
2025-05-07T20:32:31.6050521Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:31.6050853Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:31.6051165Z             )
2025-05-07T20:32:31.6051355Z         else:
2025-05-07T20:32:31.6051566Z             scale_ub_tensor = None
2025-05-07T20:32:31.6051821Z     
2025-05-07T20:32:31.6052050Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:31.6052363Z             op = silu_mul_quant
2025-05-07T20:32:31.6052615Z             if compiled:
2025-05-07T20:32:31.6052856Z                 op = torch.compile(op)
2025-05-07T20:32:31.6053151Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.6053433Z     
2025-05-07T20:32:31.6053622Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:31.6053794Z 
2025-05-07T20:32:31.6053893Z moe/activation_test.py:117: 
2025-05-07T20:32:31.6054193Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.6054524Z moe/activation_test.py:115: in fn
2025-05-07T20:32:31.6054799Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:31.6055483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:31.6056170Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:31.6056705Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:31.6057377Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:31.6058031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:31.6058682Z     kernel = self.compile(
2025-05-07T20:32:31.6059221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:31.6059864Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.6060257Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:31.6060480Z 
2025-05-07T20:32:31.6060690Z self = <triton.compiler.compiler.ASTSource object at 0x7f13ea24c950>
2025-05-07T20:32:31.6061751Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:31.6063106Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13ea28e160>}
2025-05-07T20:32:31.6064445Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:31.6065454Z context = <triton._C.libtriton.ir.context object at 0x7f13e11de8b0>
2025-05-07T20:32:31.6065735Z 
2025-05-07T20:32:31.6065902Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:31.6066423Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.6066887Z                            module_map=module_map)
2025-05-07T20:32:31.6067247Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.6067705Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.6067963Z E       ^
2025-05-07T20:32:31.6068425Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.6068871Z 
2025-05-07T20:32:31.6069306Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:31.8665898Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:31.8666985Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:32:31.8668366Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:31.8669781Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:31.8670756Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:31.8672023Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:31.8673379Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.8674664Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:31.8676015Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.8677207Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                        module_map=module_map)
2025-05-07T20:32:31.8678444Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:31.8679667Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:32:31.8680510Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:31.8681688Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:31.8682887Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:32:31.8683891Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:31.8684914Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:32:31.8686116Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:31.8687374Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:31.8688347Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:31.8689413Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:31.8690430Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:32:31.8691185Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:31.8692340Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:31.8693851Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:31.8695208Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.8696316Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.8697215Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:32:31.8698467Z W0507 20:32:31.862000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:31.9367235Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:31.9368578Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:32:31.9369893Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:31.9371393Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:31.9372355Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:31.9373644Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:31.9375004Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:31.9376297Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:31.9377644Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:31.9378682Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                        module_map=module_map)
2025-05-07T20:32:31.9380064Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:31.9381290Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:32:31.9382114Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:31.9383296Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:31.9384483Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:32:31.9385510Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:31.9386510Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:32:31.9387792Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:31.9389053Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:31.9389941Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:31.9391098Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:31.9392117Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:32:31.9392865Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:31.9394013Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:31.9395338Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:31.9396388Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:31.9397274Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:31.9398009Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:32:31.9399005Z W0507 20:32:31.933000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.1453450Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:32.1454506Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:32:32.1456013Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:32.1457413Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:32.1458377Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:32.1459658Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:32.1461027Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.1462313Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:32.1463652Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.1464675Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                        module_map=module_map)
2025-05-07T20:32:32.1465915Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:32.1467292Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:32:32.1468172Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:32.1469359Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:32.1470543Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:32:32.1471552Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:32.1472572Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:32:32.1473758Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:32.1475012Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:32.1475902Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:32.1476967Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:32.1478079Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:32:32.1478835Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:32.1479981Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:32.1481315Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:32.1482362Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.1483269Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.1484006Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:32:32.1485011Z W0507 20:32:32.141000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.1553791Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:32.1554837Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] Traceback (most recent call last):
2025-05-07T20:32:32.1556146Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:32.1557684Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:32.1558677Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:32.1559966Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:32.1561327Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.1562620Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:32.1563964Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.1565012Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]                        module_map=module_map)
2025-05-07T20:32:32.1566254Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:32.1567477Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     generator.visit(fn.parse())
2025-05-07T20:32:32.1568425Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:32.1569625Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:32.1570815Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ret = super().visit(node)
2025-05-07T20:32:32.1571839Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:32.1572848Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     return visitor(node)
2025-05-07T20:32:32.1574057Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:32.1575314Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:32.1576205Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:32.1577273Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:32.1578293Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     self.visit(item)
2025-05-07T20:32:32.1579122Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:32.1580272Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:32.1581604Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:32.1582646Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.1583540Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.1584260Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ^
2025-05-07T20:32:32.1585264Z W0507 20:32:32.151000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.4606605Z 
2025-05-07T20:32:32.4607151Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.4607600Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.4608017Z     T=2048,
2025-05-07T20:32:32.4608206Z     D=5120,
2025-05-07T20:32:32.4608399Z     scale_ub=1200.0,
2025-05-07T20:32:32.4608613Z     contiguous=True,
2025-05-07T20:32:32.4608836Z     compiled=True,
2025-05-07T20:32:32.4609047Z )
2025-05-07T20:32:32.4609357Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.4609854Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:32.4610129Z 
2025-05-07T20:32:32.4610229Z     @given(
2025-05-07T20:32:32.4610471Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.4610781Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.4611304Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.4611642Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.4611966Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.4612250Z     )
2025-05-07T20:32:32.4612589Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.4613045Z     def test_silu_mul_quant(
2025-05-07T20:32:32.4613285Z         self,
2025-05-07T20:32:32.4613467Z         T: int,
2025-05-07T20:32:32.4613661Z         D: int,
2025-05-07T20:32:32.4613880Z         scale_ub: Optional[float],
2025-05-07T20:32:32.4614139Z         contiguous: bool,
2025-05-07T20:32:32.4614376Z         compiled: bool,
2025-05-07T20:32:32.4614597Z     ) -> None:
2025-05-07T20:32:32.4614830Z         torch.manual_seed(2025)
2025-05-07T20:32:32.4615090Z     
2025-05-07T20:32:32.4615362Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.4615697Z     
2025-05-07T20:32:32.4615889Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.4616176Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.4616476Z         x = x_sign * x_clamp
2025-05-07T20:32:32.4616707Z         x0 = x[:, :D]
2025-05-07T20:32:32.4616919Z         x1 = x[:, D:]
2025-05-07T20:32:32.4617120Z     
2025-05-07T20:32:32.4617291Z         if contiguous:
2025-05-07T20:32:32.4617518Z             x0 = x0.contiguous()
2025-05-07T20:32:32.4617764Z             x1 = x1.contiguous()
2025-05-07T20:32:32.4617987Z     
2025-05-07T20:32:32.4618171Z         if scale_ub is not None:
2025-05-07T20:32:32.4618439Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.4618759Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.4619055Z             )
2025-05-07T20:32:32.4619376Z         else:
2025-05-07T20:32:32.4619574Z             scale_ub_tensor = None
2025-05-07T20:32:32.4619813Z     
2025-05-07T20:32:32.4620042Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.4620341Z             op = silu_mul_quant
2025-05-07T20:32:32.4620588Z             if compiled:
2025-05-07T20:32:32.4620828Z                 op = torch.compile(op)
2025-05-07T20:32:32.4621110Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.4621374Z     
2025-05-07T20:32:32.4621557Z         y_fp8, y_scale = fn()
2025-05-07T20:32:32.4621835Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:32.4622113Z     
2025-05-07T20:32:32.4622346Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.4622670Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:32.4622952Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:32.4623255Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:32.4623608Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:32.4623913Z     
2025-05-07T20:32:32.4624115Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:32.4624302Z 
2025-05-07T20:32:32.4624413Z moe/activation_test.py:126: 
2025-05-07T20:32:32.4624702Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.4625028Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:32.4625348Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:32.4626128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:32.4626864Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:32.4627403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.4628151Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.4628844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:32.4629643Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:32.4630370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:32.4630997Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:32.4631589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:32.4632102Z     fn()
2025-05-07T20:32:32.4632611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:32.4633181Z     self.fn.run(
2025-05-07T20:32:32.4633643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.4634167Z     kernel = self.compile(
2025-05-07T20:32:32.4634709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.4635386Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.4635774Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.4636003Z 
2025-05-07T20:32:32.4636203Z self = <triton.compiler.compiler.ASTSource object at 0x7f13f8b558d0>
2025-05-07T20:32:32.4637266Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.4638632Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13ea1be660>}
2025-05-07T20:32:32.4640361Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.4641433Z context = <triton._C.libtriton.ir.context object at 0x7f13e0258930>
2025-05-07T20:32:32.4641723Z 
2025-05-07T20:32:32.4641886Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.4642395Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.4642848Z                            module_map=module_map)
2025-05-07T20:32:32.4643209Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.4643563Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:32.4643822Z E       ^
2025-05-07T20:32:32.4644267Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.4644738Z 
2025-05-07T20:32:32.4645164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.4645665Z 
2025-05-07T20:32:32.4645776Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:32.4646175Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:32.4646573Z     T=16384,
2025-05-07T20:32:32.4646756Z     D=7168,
2025-05-07T20:32:32.4646949Z     scale_ub=1200.0,
2025-05-07T20:32:32.4647171Z     contiguous=False,
2025-05-07T20:32:32.4647384Z     compiled=False,
2025-05-07T20:32:32.4647591Z )
2025-05-07T20:32:32.4647907Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:32.4648394Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:32.4648675Z 
2025-05-07T20:32:32.4648749Z     @given(
2025-05-07T20:32:32.4648979Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:32.4649343Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:32.4649776Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:32.4650337Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:32.4658901Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:32.4659204Z     )
2025-05-07T20:32:32.4659545Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:32.4660013Z     def test_silu_mul_quant(
2025-05-07T20:32:32.4660259Z         self,
2025-05-07T20:32:32.4660449Z         T: int,
2025-05-07T20:32:32.4660652Z         D: int,
2025-05-07T20:32:32.4660872Z         scale_ub: Optional[float],
2025-05-07T20:32:32.4661135Z         contiguous: bool,
2025-05-07T20:32:32.4661375Z         compiled: bool,
2025-05-07T20:32:32.4661593Z     ) -> None:
2025-05-07T20:32:32.4661811Z         torch.manual_seed(2025)
2025-05-07T20:32:32.4662050Z     
2025-05-07T20:32:32.4662312Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:32.4662654Z     
2025-05-07T20:32:32.4662848Z         x_sign = torch.sign(x)
2025-05-07T20:32:32.4663140Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:32.4663445Z         x = x_sign * x_clamp
2025-05-07T20:32:32.4663679Z         x0 = x[:, :D]
2025-05-07T20:32:32.4663886Z         x1 = x[:, D:]
2025-05-07T20:32:32.4664083Z     
2025-05-07T20:32:32.4664268Z         if contiguous:
2025-05-07T20:32:32.4664496Z             x0 = x0.contiguous()
2025-05-07T20:32:32.4664770Z             x1 = x1.contiguous()
2025-05-07T20:32:32.4665035Z     
2025-05-07T20:32:32.4665222Z         if scale_ub is not None:
2025-05-07T20:32:32.4665485Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:32.4665812Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:32.4666117Z             )
2025-05-07T20:32:32.4666301Z         else:
2025-05-07T20:32:32.4666503Z             scale_ub_tensor = None
2025-05-07T20:32:32.4666748Z     
2025-05-07T20:32:32.4666967Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:32.4667411Z             op = silu_mul_quant
2025-05-07T20:32:32.4667716Z             if compiled:
2025-05-07T20:32:32.4667956Z                 op = torch.compile(op)
2025-05-07T20:32:32.4668244Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.4668511Z     
2025-05-07T20:32:32.4668696Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:32.4668854Z 
2025-05-07T20:32:32.4668947Z moe/activation_test.py:117: 
2025-05-07T20:32:32.4669233Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.4669559Z moe/activation_test.py:115: in fn
2025-05-07T20:32:32.4669825Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:32.4670515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:32.4671192Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:32.4671720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:32.4672404Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:32.4673061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:32.4673597Z     kernel = self.compile(
2025-05-07T20:32:32.4674148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:32.4674788Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.4675181Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:32.4675403Z 
2025-05-07T20:32:32.4675613Z self = <triton.compiler.compiler.ASTSource object at 0x7f13ea2a4550>
2025-05-07T20:32:32.4676675Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:32.4678155Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13ea2a1080>}
2025-05-07T20:32:32.4679483Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:32.4680488Z context = <triton._C.libtriton.ir.context object at 0x7f13c7aadaf0>
2025-05-07T20:32:32.4680767Z 
2025-05-07T20:32:32.4680935Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:32.4681440Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.4681907Z                            module_map=module_map)
2025-05-07T20:32:32.4682267Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.4682604Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.4682855Z E       ^
2025-05-07T20:32:32.4683331Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.4683772Z 
2025-05-07T20:32:32.4684204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:32.6443177Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:32.6444239Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:32:32.6445561Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:32.6447156Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:32.6448127Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:32.6449402Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:32.6450770Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.6452078Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:32.6453444Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.6454493Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                        module_map=module_map)
2025-05-07T20:32:32.6455740Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:32.6456961Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:32:32.6457915Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:32.6459102Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:32.6460298Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:32:32.6461317Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:32.6462327Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:32:32.6463540Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:32.6464793Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:32.6465680Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:32.6466732Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:32.6467812Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:32:32.6468653Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:32.6469809Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:32.6471137Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:32.6472165Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.6473056Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.6473778Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:32:32.6474787Z W0507 20:32:32.640000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.6947970Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:32.6949033Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:32:32.6950360Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:32.6951776Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:32.6952930Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:32.6954268Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:32.6955621Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.6956906Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:32.6958283Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.6959324Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                        module_map=module_map)
2025-05-07T20:32:32.6960616Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:32.6961838Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:32:32.6962675Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:32.6964014Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:32.6965260Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:32:32.6966291Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:32.6967294Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:32:32.6968502Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:32.6969782Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:32.6970690Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:32.6971759Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:32.6972782Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:32:32.6973543Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:32.6974701Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:32.6976138Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:32.6977176Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.6978072Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.6978797Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:32:32.6979804Z W0507 20:32:32.691000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.8647521Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:32.8648563Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:32:32.8649884Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:32.8651292Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:32.8652254Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:32.8653720Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:32.8655081Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.8656365Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:32.8657715Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.8658757Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                        module_map=module_map)
2025-05-07T20:32:32.8660001Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:32.8661218Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:32:32.8662048Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:32.8663228Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:32.8664538Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:32:32.8665559Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:32.8666551Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:32:32.8667849Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:32.8669106Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:32.8670002Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:32.8671093Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:32.8672120Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:32:32.8672887Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:32.8674038Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:32.8675423Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:32.8676549Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.8677448Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.8678184Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:32:32.8679189Z W0507 20:32:32.861000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:32.8745391Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:32.8746434Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] Traceback (most recent call last):
2025-05-07T20:32:32.8747862Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:32.8749288Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:32.8750246Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:32.8751525Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:32.8753022Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:32.8754304Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:32.8755653Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:32.8756673Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]                        module_map=module_map)
2025-05-07T20:32:32.8757980Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:32.8759199Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     generator.visit(fn.parse())
2025-05-07T20:32:32.8760019Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:32.8761197Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:32.8762387Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ret = super().visit(node)
2025-05-07T20:32:32.8763522Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:32.8764517Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     return visitor(node)
2025-05-07T20:32:32.8765758Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:32.8767008Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:32.8767890Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:32.8768959Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:32.8770359Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     self.visit(item)
2025-05-07T20:32:32.8771239Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:32.8772394Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:32.8773732Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:32.8774767Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:32.8775786Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:32.8776517Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ^
2025-05-07T20:32:32.8777516Z W0507 20:32:32.871000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.5555389Z 
2025-05-07T20:32:33.5556205Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.5556895Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.5557453Z     T=1,
2025-05-07T20:32:33.5557694Z     D=7168,
2025-05-07T20:32:33.5557896Z     scale_ub=None,
2025-05-07T20:32:33.5558106Z     contiguous=True,
2025-05-07T20:32:33.5558362Z     compiled=True,
2025-05-07T20:32:33.5558566Z )
2025-05-07T20:32:33.5558873Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.5559363Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:33.5559632Z 
2025-05-07T20:32:33.5559704Z     @given(
2025-05-07T20:32:33.5559938Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.5560242Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.5560554Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.5560882Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.5561198Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.5561478Z     )
2025-05-07T20:32:33.5561830Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.5562263Z     def test_silu_mul_quant(
2025-05-07T20:32:33.5562505Z         self,
2025-05-07T20:32:33.5562695Z         T: int,
2025-05-07T20:32:33.5563246Z         D: int,
2025-05-07T20:32:33.5563452Z         scale_ub: Optional[float],
2025-05-07T20:32:33.5563721Z         contiguous: bool,
2025-05-07T20:32:33.5563969Z         compiled: bool,
2025-05-07T20:32:33.5564186Z     ) -> None:
2025-05-07T20:32:33.5564401Z         torch.manual_seed(2025)
2025-05-07T20:32:33.5564648Z     
2025-05-07T20:32:33.5564920Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.5565265Z     
2025-05-07T20:32:33.5565457Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.5565742Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.5566042Z         x = x_sign * x_clamp
2025-05-07T20:32:33.5566275Z         x0 = x[:, :D]
2025-05-07T20:32:33.5566480Z         x1 = x[:, D:]
2025-05-07T20:32:33.5566678Z     
2025-05-07T20:32:33.5566856Z         if contiguous:
2025-05-07T20:32:33.5567077Z             x0 = x0.contiguous()
2025-05-07T20:32:33.5567323Z             x1 = x1.contiguous()
2025-05-07T20:32:33.5567565Z     
2025-05-07T20:32:33.5567743Z         if scale_ub is not None:
2025-05-07T20:32:33.5568011Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.5568341Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.5568651Z             )
2025-05-07T20:32:33.5568827Z         else:
2025-05-07T20:32:33.5569029Z             scale_ub_tensor = None
2025-05-07T20:32:33.5569273Z     
2025-05-07T20:32:33.5569494Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.5569798Z             op = silu_mul_quant
2025-05-07T20:32:33.5570037Z             if compiled:
2025-05-07T20:32:33.5570309Z                 op = torch.compile(op)
2025-05-07T20:32:33.5570596Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.5570852Z     
2025-05-07T20:32:33.5571033Z         y_fp8, y_scale = fn()
2025-05-07T20:32:33.5571314Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:33.5571587Z     
2025-05-07T20:32:33.5571814Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.5572141Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:33.5572423Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:33.5572882Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:33.5573234Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:33.5573531Z     
2025-05-07T20:32:33.5573727Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:33.5573925Z 
2025-05-07T20:32:33.5574019Z moe/activation_test.py:126: 
2025-05-07T20:32:33.5574310Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.5574628Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:33.5574952Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:33.5575732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:33.5576465Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:33.5577004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.5577686Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.5578367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:33.5579069Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:33.5579801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:33.5580452Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:33.5581046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:33.5581546Z     fn()
2025-05-07T20:32:33.5582047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:33.5582706Z     self.fn.run(
2025-05-07T20:32:33.5583169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.5583691Z     kernel = self.compile(
2025-05-07T20:32:33.5584224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.5584908Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.5585322Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.5585548Z 
2025-05-07T20:32:33.5585750Z self = <triton.compiler.compiler.ASTSource object at 0x7f13f8a8b150>
2025-05-07T20:32:33.5586815Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.5588349Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13ea2a3740>}
2025-05-07T20:32:33.5589663Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.5590658Z context = <triton._C.libtriton.ir.context object at 0x7f13c7889cb0>
2025-05-07T20:32:33.5590947Z 
2025-05-07T20:32:33.5591108Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.5591623Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.5592084Z                            module_map=module_map)
2025-05-07T20:32:33.5592438Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.5592793Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:33.5593051Z E       ^
2025-05-07T20:32:33.5593576Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.5594045Z 
2025-05-07T20:32:33.5594462Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.5594971Z 
2025-05-07T20:32:33.5595068Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:33.5595478Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:33.5595860Z     T=4096,
2025-05-07T20:32:33.5596044Z     D=5120,
2025-05-07T20:32:33.5596231Z     scale_ub=None,
2025-05-07T20:32:33.5596434Z     contiguous=False,
2025-05-07T20:32:33.5596653Z     compiled=False,
2025-05-07T20:32:33.5596854Z )
2025-05-07T20:32:33.5597160Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:33.5597650Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:33.5597925Z 
2025-05-07T20:32:33.5598001Z     @given(
2025-05-07T20:32:33.5598227Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:33.5598532Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:33.5598833Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:33.5599156Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:33.5599470Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:33.5599749Z     )
2025-05-07T20:32:33.5600086Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:33.5600511Z     def test_silu_mul_quant(
2025-05-07T20:32:33.5600746Z         self,
2025-05-07T20:32:33.5600941Z         T: int,
2025-05-07T20:32:33.5601123Z         D: int,
2025-05-07T20:32:33.5601333Z         scale_ub: Optional[float],
2025-05-07T20:32:33.5601595Z         contiguous: bool,
2025-05-07T20:32:33.5601911Z         compiled: bool,
2025-05-07T20:32:33.5602134Z     ) -> None:
2025-05-07T20:32:33.5602347Z         torch.manual_seed(2025)
2025-05-07T20:32:33.5602582Z     
2025-05-07T20:32:33.5602847Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:33.5603185Z     
2025-05-07T20:32:33.5603370Z         x_sign = torch.sign(x)
2025-05-07T20:32:33.5603643Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:33.5603942Z         x = x_sign * x_clamp
2025-05-07T20:32:33.5604174Z         x0 = x[:, :D]
2025-05-07T20:32:33.5604376Z         x1 = x[:, D:]
2025-05-07T20:32:33.5604579Z     
2025-05-07T20:32:33.5604755Z         if contiguous:
2025-05-07T20:32:33.5604970Z             x0 = x0.contiguous()
2025-05-07T20:32:33.5605217Z             x1 = x1.contiguous()
2025-05-07T20:32:33.5605452Z     
2025-05-07T20:32:33.5605628Z         if scale_ub is not None:
2025-05-07T20:32:33.5605896Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:33.5606229Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:33.5606522Z             )
2025-05-07T20:32:33.5606703Z         else:
2025-05-07T20:32:33.5606913Z             scale_ub_tensor = None
2025-05-07T20:32:33.5607160Z     
2025-05-07T20:32:33.5607377Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:33.5607680Z             op = silu_mul_quant
2025-05-07T20:32:33.5607927Z             if compiled:
2025-05-07T20:32:33.5608160Z                 op = torch.compile(op)
2025-05-07T20:32:33.5608449Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.5608716Z     
2025-05-07T20:32:33.5608895Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:33.5609063Z 
2025-05-07T20:32:33.5609160Z moe/activation_test.py:117: 
2025-05-07T20:32:33.5609457Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.5609779Z moe/activation_test.py:115: in fn
2025-05-07T20:32:33.5610058Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:33.5610744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:33.5611505Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:33.5612054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:33.5612730Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:33.5613395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:33.5613908Z     kernel = self.compile(
2025-05-07T20:32:33.5614442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:33.5615109Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.5615501Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:33.5615727Z 
2025-05-07T20:32:33.5615931Z self = <triton.compiler.compiler.ASTSource object at 0x7f13ea213bd0>
2025-05-07T20:32:33.5616997Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:33.5618359Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13e0527100>}
2025-05-07T20:32:33.5619672Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:33.5620681Z context = <triton._C.libtriton.ir.context object at 0x7f13c715f170>
2025-05-07T20:32:33.5620960Z 
2025-05-07T20:32:33.5621203Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:33.5621721Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.5622179Z                            module_map=module_map)
2025-05-07T20:32:33.5622539Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.5622884Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.5623141Z E       ^
2025-05-07T20:32:33.5623598Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:33.5624049Z 
2025-05-07T20:32:33.5632896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:33.8325021Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:33.8326211Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:32:33.8327544Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:33.8328937Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:33.8329909Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:33.8331220Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:33.8332789Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:33.8334125Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:33.8335527Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:33.8336570Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                        module_map=module_map)
2025-05-07T20:32:33.8337819Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:33.8339045Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:32:33.8339873Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:33.8341240Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:33.8342427Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:32:33.8343442Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:33.8344576Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:32:33.8345771Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:33.8347024Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:33.8347955Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:33.8349016Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:33.8350040Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:32:33.8350788Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:33.8351951Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:33.8353277Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:33.8354301Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:33.8355311Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:33.8356033Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:32:33.8357026Z W0507 20:32:33.828000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.0040473Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:34.0041932Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:32:34.0043489Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:34.0044891Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:34.0045900Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:34.0047179Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:34.0048572Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.0050045Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:34.0051399Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.0052445Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                        module_map=module_map)
2025-05-07T20:32:34.0053734Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:34.0054971Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:32:34.0055794Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:34.0056980Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:34.0058163Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:32:34.0059178Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:34.0060184Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:32:34.0061489Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:34.0062791Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:34.0063678Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:34.0064749Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:34.0065815Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:32:34.0066588Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:34.0067823Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:34.0069154Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:34.0070194Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.0071098Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.0071911Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:32:34.0072915Z W0507 20:32:34.000000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.2643697Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:34.2644971Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:32:34.2646289Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:34.2647707Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:34.2648675Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:34.2649979Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:34.2651337Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.2652608Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:34.2654129Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.2655160Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                        module_map=module_map)
2025-05-07T20:32:34.2656399Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:34.2657619Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:32:34.2658430Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:34.2659619Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:34.2660811Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:32:34.2661825Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:34.2662831Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:32:34.2664023Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:34.2665484Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:34.2666363Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:34.2667434Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:34.2668574Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:32:34.2669318Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:34.2670474Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:34.2671812Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:34.2672880Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.2673770Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.2674499Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:32:34.2675502Z W0507 20:32:34.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:34.2742420Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:34.2743668Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] Traceback (most recent call last):
2025-05-07T20:32:34.2744959Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:34.2746350Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:34.2747334Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:34.2748684Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:34.2750035Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:34.2751305Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:34.2752640Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:34.2754636Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]                        module_map=module_map)
2025-05-07T20:32:34.2755876Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:34.2757098Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     generator.visit(fn.parse())
2025-05-07T20:32:34.2757917Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:34.2759084Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:34.2760268Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ret = super().visit(node)
2025-05-07T20:32:34.2761282Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:34.2762280Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     return visitor(node)
2025-05-07T20:32:34.2763464Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:34.2764713Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:34.2765726Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:34.2766791Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:34.2767804Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     self.visit(item)
2025-05-07T20:32:34.2768545Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:34.2769684Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:34.2771021Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:34.2772061Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:34.2772949Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:34.2773664Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ^
2025-05-07T20:32:34.2774659Z W0507 20:32:34.270000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.4399659Z 
2025-05-07T20:32:35.4400587Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.4401749Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.4402168Z     T=4096,
2025-05-07T20:32:35.4402358Z     D=7168,
2025-05-07T20:32:35.4402557Z     scale_ub=None,
2025-05-07T20:32:35.4402769Z     contiguous=False,
2025-05-07T20:32:35.4402996Z     compiled=False,
2025-05-07T20:32:35.4403207Z )
2025-05-07T20:32:35.4403517Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.4404004Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:35.4404287Z 
2025-05-07T20:32:35.4404365Z     @given(
2025-05-07T20:32:35.4404600Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.4404913Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.4405227Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.4405612Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.4405936Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.4406231Z     )
2025-05-07T20:32:35.4406588Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.4407039Z     def test_silu_mul_quant(
2025-05-07T20:32:35.4407290Z         self,
2025-05-07T20:32:35.4407483Z         T: int,
2025-05-07T20:32:35.4407678Z         D: int,
2025-05-07T20:32:35.4407904Z         scale_ub: Optional[float],
2025-05-07T20:32:35.4408173Z         contiguous: bool,
2025-05-07T20:32:35.4408411Z         compiled: bool,
2025-05-07T20:32:35.4408632Z     ) -> None:
2025-05-07T20:32:35.4408850Z         torch.manual_seed(2025)
2025-05-07T20:32:35.4409094Z     
2025-05-07T20:32:35.4409359Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.4409704Z     
2025-05-07T20:32:35.4409895Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.4410177Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.4410487Z         x = x_sign * x_clamp
2025-05-07T20:32:35.4410732Z         x0 = x[:, :D]
2025-05-07T20:32:35.4410944Z         x1 = x[:, D:]
2025-05-07T20:32:35.4411150Z     
2025-05-07T20:32:35.4411514Z         if contiguous:
2025-05-07T20:32:35.4411922Z             x0 = x0.contiguous()
2025-05-07T20:32:35.4412189Z             x1 = x1.contiguous()
2025-05-07T20:32:35.4412420Z     
2025-05-07T20:32:35.4412599Z         if scale_ub is not None:
2025-05-07T20:32:35.4412869Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.4413207Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.4413527Z             )
2025-05-07T20:32:35.4413721Z         else:
2025-05-07T20:32:35.4413938Z             scale_ub_tensor = None
2025-05-07T20:32:35.4414188Z     
2025-05-07T20:32:35.4414421Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.4414737Z             op = silu_mul_quant
2025-05-07T20:32:35.4414978Z             if compiled:
2025-05-07T20:32:35.4415225Z                 op = torch.compile(op)
2025-05-07T20:32:35.4415525Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.4415800Z     
2025-05-07T20:32:35.4415990Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.4416158Z 
2025-05-07T20:32:35.4416265Z moe/activation_test.py:117: 
2025-05-07T20:32:35.4416560Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.4416889Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.4417172Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.4417866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.4418544Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.4419085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.4419757Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.4420413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.4421020Z     kernel = self.compile(
2025-05-07T20:32:35.4421577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.4422238Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.4422629Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.4422852Z 
2025-05-07T20:32:35.4423056Z self = <triton.compiler.compiler.ASTSource object at 0x7f13e05408d0>
2025-05-07T20:32:35.4424132Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.4425525Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13e0526f20>}
2025-05-07T20:32:35.4426898Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.4427957Z context = <triton._C.libtriton.ir.context object at 0x7f13c726c9b0>
2025-05-07T20:32:35.4428244Z 
2025-05-07T20:32:35.4428409Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.4428934Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.4429396Z                            module_map=module_map)
2025-05-07T20:32:35.4429752Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.4430103Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.4430359Z E       ^
2025-05-07T20:32:35.4430809Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.4431260Z 
2025-05-07T20:32:35.4431763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.4432272Z 
2025-05-07T20:32:35.4432374Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.4432780Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.4433184Z     T=128,
2025-05-07T20:32:35.4433372Z     D=7168,
2025-05-07T20:32:35.4433562Z     scale_ub=None,
2025-05-07T20:32:35.4433769Z     contiguous=False,
2025-05-07T20:32:35.4433996Z     compiled=True,
2025-05-07T20:32:35.4434203Z )
2025-05-07T20:32:35.4434514Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.4434995Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:35.4435265Z 
2025-05-07T20:32:35.4435346Z     @given(
2025-05-07T20:32:35.4435580Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.4435889Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.4436202Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.4436537Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.4436854Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.4437137Z     )
2025-05-07T20:32:35.4437482Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.4437927Z     def test_silu_mul_quant(
2025-05-07T20:32:35.4438170Z         self,
2025-05-07T20:32:35.4438366Z         T: int,
2025-05-07T20:32:35.4438556Z         D: int,
2025-05-07T20:32:35.4438775Z         scale_ub: Optional[float],
2025-05-07T20:32:35.4439050Z         contiguous: bool,
2025-05-07T20:32:35.4439295Z         compiled: bool,
2025-05-07T20:32:35.4439510Z     ) -> None:
2025-05-07T20:32:35.4439729Z         torch.manual_seed(2025)
2025-05-07T20:32:35.4439972Z     
2025-05-07T20:32:35.4440515Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.4440860Z     
2025-05-07T20:32:35.4441048Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.4441338Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.4441644Z         x = x_sign * x_clamp
2025-05-07T20:32:35.4441879Z         x0 = x[:, :D]
2025-05-07T20:32:35.4442086Z         x1 = x[:, D:]
2025-05-07T20:32:35.4442291Z     
2025-05-07T20:32:35.4442472Z         if contiguous:
2025-05-07T20:32:35.4442694Z             x0 = x0.contiguous()
2025-05-07T20:32:35.4442948Z             x1 = x1.contiguous()
2025-05-07T20:32:35.4443189Z     
2025-05-07T20:32:35.4443373Z         if scale_ub is not None:
2025-05-07T20:32:35.4443646Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.4443976Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.4444278Z             )
2025-05-07T20:32:35.4444466Z         else:
2025-05-07T20:32:35.4444672Z             scale_ub_tensor = None
2025-05-07T20:32:35.4444923Z     
2025-05-07T20:32:35.4445149Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.4445461Z             op = silu_mul_quant
2025-05-07T20:32:35.4445718Z             if compiled:
2025-05-07T20:32:35.4445954Z                 op = torch.compile(op)
2025-05-07T20:32:35.4446244Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.4446514Z     
2025-05-07T20:32:35.4446696Z         y_fp8, y_scale = fn()
2025-05-07T20:32:35.4446976Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:35.4447261Z     
2025-05-07T20:32:35.4447487Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.4447819Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:35.4448109Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:35.4448416Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:35.4448765Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.4449084Z     
2025-05-07T20:32:35.4449282Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.4449471Z 
2025-05-07T20:32:35.4449568Z moe/activation_test.py:126: 
2025-05-07T20:32:35.4449990Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.4450325Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:35.4450643Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.4451428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:35.4452167Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.4452713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.4453382Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.4454074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:35.4454793Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.4455532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:35.4456161Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.4456768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:35.4457282Z     fn()
2025-05-07T20:32:35.4457941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:35.4458530Z     self.fn.run(
2025-05-07T20:32:35.4459007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.4459524Z     kernel = self.compile(
2025-05-07T20:32:35.4460050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.4460827Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.4461219Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.4461440Z 
2025-05-07T20:32:35.4461641Z self = <triton.compiler.compiler.ASTSource object at 0x7f13e03d97d0>
2025-05-07T20:32:35.4462701Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.4464051Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13e0527e20>}
2025-05-07T20:32:35.4465368Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.4466383Z context = <triton._C.libtriton.ir.context object at 0x7f13c6fc8fb0>
2025-05-07T20:32:35.4466664Z 
2025-05-07T20:32:35.4466831Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.4467345Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.4467876Z                            module_map=module_map)
2025-05-07T20:32:35.4468241Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.4468586Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.4468853Z E       ^
2025-05-07T20:32:35.4469308Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.4469898Z 
2025-05-07T20:32:35.4470316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.6872207Z 
2025-05-07T20:32:35.6872401Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.6874378Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.6875522Z     T=128,
2025-05-07T20:32:35.6875789Z     D=7168,
2025-05-07T20:32:35.6876053Z     scale_ub=None,
2025-05-07T20:32:35.6876364Z     contiguous=False,
2025-05-07T20:32:35.6876679Z     compiled=False,
2025-05-07T20:32:35.6876936Z )
2025-05-07T20:32:35.6877258Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.6877754Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:35.6878036Z 
2025-05-07T20:32:35.6878126Z     @given(
2025-05-07T20:32:35.6878351Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.6878668Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.6878972Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.6879295Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.6879619Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.6879907Z     )
2025-05-07T20:32:35.6880259Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.6880713Z     def test_silu_mul_quant(
2025-05-07T20:32:35.6880959Z         self,
2025-05-07T20:32:35.6881155Z         T: int,
2025-05-07T20:32:35.6881345Z         D: int,
2025-05-07T20:32:35.6881566Z         scale_ub: Optional[float],
2025-05-07T20:32:35.6881848Z         contiguous: bool,
2025-05-07T20:32:35.6882078Z         compiled: bool,
2025-05-07T20:32:35.6882297Z     ) -> None:
2025-05-07T20:32:35.6882510Z         torch.manual_seed(2025)
2025-05-07T20:32:35.6882744Z     
2025-05-07T20:32:35.6883018Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.6883357Z     
2025-05-07T20:32:35.6883543Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.6883979Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.6884294Z         x = x_sign * x_clamp
2025-05-07T20:32:35.6884534Z         x0 = x[:, :D]
2025-05-07T20:32:35.6884744Z         x1 = x[:, D:]
2025-05-07T20:32:35.6884949Z     
2025-05-07T20:32:35.6885124Z         if contiguous:
2025-05-07T20:32:35.6885353Z             x0 = x0.contiguous()
2025-05-07T20:32:35.6885606Z             x1 = x1.contiguous()
2025-05-07T20:32:35.6885839Z     
2025-05-07T20:32:35.6886020Z         if scale_ub is not None:
2025-05-07T20:32:35.6886287Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.6886618Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.6886926Z             )
2025-05-07T20:32:35.6887122Z         else:
2025-05-07T20:32:35.6887328Z             scale_ub_tensor = None
2025-05-07T20:32:35.6887568Z     
2025-05-07T20:32:35.6887795Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.6888108Z             op = silu_mul_quant
2025-05-07T20:32:35.6888352Z             if compiled:
2025-05-07T20:32:35.6888595Z                 op = torch.compile(op)
2025-05-07T20:32:35.6888893Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.6889165Z     
2025-05-07T20:32:35.6889354Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.6889514Z 
2025-05-07T20:32:35.6889621Z moe/activation_test.py:117: 
2025-05-07T20:32:35.6889912Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.6890237Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.6890514Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.6891232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.6891905Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.6892437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.6901923Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.6902721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.6903289Z     kernel = self.compile(
2025-05-07T20:32:35.6903832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.6904491Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.6904896Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.6905203Z 
2025-05-07T20:32:35.6905511Z self = <triton.compiler.compiler.ASTSource object at 0x7f13e0588fd0>
2025-05-07T20:32:35.6906625Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.6908161Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c73999e0>}
2025-05-07T20:32:35.6909502Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.6910520Z context = <triton._C.libtriton.ir.context object at 0x7f13c6ffbe70>
2025-05-07T20:32:35.6910804Z 
2025-05-07T20:32:35.6910970Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.6911500Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.6911969Z                            module_map=module_map)
2025-05-07T20:32:35.6912326Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.6912787Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.6913046Z E       ^
2025-05-07T20:32:35.6913520Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.6913964Z 
2025-05-07T20:32:35.6914394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.6914903Z 
2025-05-07T20:32:35.6915003Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.6915412Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.6915821Z     T=4096,
2025-05-07T20:32:35.6916081Z     D=5120,
2025-05-07T20:32:35.6916349Z     scale_ub=1200.0,
2025-05-07T20:32:35.6916664Z     contiguous=True,
2025-05-07T20:32:35.6916919Z     compiled=False,
2025-05-07T20:32:35.6917118Z )
2025-05-07T20:32:35.6917430Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.6917918Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:35.6918193Z 
2025-05-07T20:32:35.6918267Z     @given(
2025-05-07T20:32:35.6918492Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.6918789Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.6919084Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.6919406Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.6919722Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.6919999Z     )
2025-05-07T20:32:35.6920344Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.6920773Z     def test_silu_mul_quant(
2025-05-07T20:32:35.6921004Z         self,
2025-05-07T20:32:35.6921195Z         T: int,
2025-05-07T20:32:35.6921393Z         D: int,
2025-05-07T20:32:35.6921607Z         scale_ub: Optional[float],
2025-05-07T20:32:35.6921872Z         contiguous: bool,
2025-05-07T20:32:35.6922115Z         compiled: bool,
2025-05-07T20:32:35.6922329Z     ) -> None:
2025-05-07T20:32:35.6922539Z         torch.manual_seed(2025)
2025-05-07T20:32:35.6922777Z     
2025-05-07T20:32:35.6923140Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.6923485Z     
2025-05-07T20:32:35.6923669Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.6923945Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.6924248Z         x = x_sign * x_clamp
2025-05-07T20:32:35.6924482Z         x0 = x[:, :D]
2025-05-07T20:32:35.6924688Z         x1 = x[:, D:]
2025-05-07T20:32:35.6924890Z     
2025-05-07T20:32:35.6925070Z         if contiguous:
2025-05-07T20:32:35.6925300Z             x0 = x0.contiguous()
2025-05-07T20:32:35.6925550Z             x1 = x1.contiguous()
2025-05-07T20:32:35.6925788Z     
2025-05-07T20:32:35.6925974Z         if scale_ub is not None:
2025-05-07T20:32:35.6926237Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.6926568Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.6926877Z             )
2025-05-07T20:32:35.6927062Z         else:
2025-05-07T20:32:35.6927274Z             scale_ub_tensor = None
2025-05-07T20:32:35.6927518Z     
2025-05-07T20:32:35.6927739Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.6928044Z             op = silu_mul_quant
2025-05-07T20:32:35.6928293Z             if compiled:
2025-05-07T20:32:35.6928527Z                 op = torch.compile(op)
2025-05-07T20:32:35.6928817Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.6929085Z     
2025-05-07T20:32:35.6929264Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:35.6929435Z 
2025-05-07T20:32:35.6929530Z moe/activation_test.py:117: 
2025-05-07T20:32:35.6929819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.6930142Z moe/activation_test.py:115: in fn
2025-05-07T20:32:35.6930407Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.6931176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:35.6931863Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:35.6932387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.6933050Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.6933704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.6934222Z     kernel = self.compile(
2025-05-07T20:32:35.6934762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.6935405Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.6935841Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.6936073Z 
2025-05-07T20:32:35.6936277Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c774b4d0>
2025-05-07T20:32:35.6937338Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.6938696Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c739a200>}
2025-05-07T20:32:35.6940008Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.6941335Z context = <triton._C.libtriton.ir.context object at 0x7f13c687f870>
2025-05-07T20:32:35.6941616Z 
2025-05-07T20:32:35.6941781Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.6942302Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.6942911Z                            module_map=module_map)
2025-05-07T20:32:35.6943276Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.6943616Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.6943872Z E       ^
2025-05-07T20:32:35.6944325Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.6944766Z 
2025-05-07T20:32:35.6945182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.6945691Z 
2025-05-07T20:32:35.6945789Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:35.6946192Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:35.6946597Z     T=1,
2025-05-07T20:32:35.6946782Z     D=5120,
2025-05-07T20:32:35.6946973Z     scale_ub=None,
2025-05-07T20:32:35.6947185Z     contiguous=True,
2025-05-07T20:32:35.6947395Z     compiled=True,
2025-05-07T20:32:35.6947656Z )
2025-05-07T20:32:35.6947967Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:35.6948438Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:35.6948691Z 
2025-05-07T20:32:35.6948765Z     @given(
2025-05-07T20:32:35.6948986Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:35.6949290Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:35.6949580Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:35.6949896Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:35.6950215Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:35.6950483Z     )
2025-05-07T20:32:35.6950830Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:35.6951387Z     def test_silu_mul_quant(
2025-05-07T20:32:35.6951614Z         self,
2025-05-07T20:32:35.6951802Z         T: int,
2025-05-07T20:32:35.6951996Z         D: int,
2025-05-07T20:32:35.6952206Z         scale_ub: Optional[float],
2025-05-07T20:32:35.6952468Z         contiguous: bool,
2025-05-07T20:32:35.6952700Z         compiled: bool,
2025-05-07T20:32:35.6952906Z     ) -> None:
2025-05-07T20:32:35.6953111Z         torch.manual_seed(2025)
2025-05-07T20:32:35.6953346Z     
2025-05-07T20:32:35.6953610Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:35.6953939Z     
2025-05-07T20:32:35.6954148Z         x_sign = torch.sign(x)
2025-05-07T20:32:35.6954429Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:35.6954729Z         x = x_sign * x_clamp
2025-05-07T20:32:35.6954956Z         x0 = x[:, :D]
2025-05-07T20:32:35.6955158Z         x1 = x[:, D:]
2025-05-07T20:32:35.6955352Z     
2025-05-07T20:32:35.6955524Z         if contiguous:
2025-05-07T20:32:35.6955755Z             x0 = x0.contiguous()
2025-05-07T20:32:35.6956005Z             x1 = x1.contiguous()
2025-05-07T20:32:35.6956229Z     
2025-05-07T20:32:35.6956415Z         if scale_ub is not None:
2025-05-07T20:32:35.6956681Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:35.6957001Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:35.6957301Z             )
2025-05-07T20:32:35.6957488Z         else:
2025-05-07T20:32:35.6957687Z             scale_ub_tensor = None
2025-05-07T20:32:35.6957932Z     
2025-05-07T20:32:35.6958160Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.6958463Z             op = silu_mul_quant
2025-05-07T20:32:35.6958703Z             if compiled:
2025-05-07T20:32:35.6958943Z                 op = torch.compile(op)
2025-05-07T20:32:35.6959236Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:35.6959495Z     
2025-05-07T20:32:35.6959691Z         y_fp8, y_scale = fn()
2025-05-07T20:32:35.6959979Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:35.6960260Z     
2025-05-07T20:32:35.6960489Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:35.6960905Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:35.6961185Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:35.6961493Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:35.6961844Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.6962143Z     
2025-05-07T20:32:35.6962338Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:35.6962535Z 
2025-05-07T20:32:35.6962630Z moe/activation_test.py:126: 
2025-05-07T20:32:35.6962917Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.6963235Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:35.6963548Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:35.6964315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:35.6965056Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:35.6965633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:35.6966303Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:35.6966979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:35.6967680Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:35.6968414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:35.6969036Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:35.6969627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:35.6970238Z     fn()
2025-05-07T20:32:35.6970742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:35.6971312Z     self.fn.run(
2025-05-07T20:32:35.6971771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:35.6972285Z     kernel = self.compile(
2025-05-07T20:32:35.6972816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:35.6973448Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.6973843Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:35.6974070Z 
2025-05-07T20:32:35.6974276Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c7412a50>
2025-05-07T20:32:35.6975344Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:35.6976752Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c739ac00>}
2025-05-07T20:32:35.6978101Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:35.6979128Z context = <triton._C.libtriton.ir.context object at 0x7f13c689e870>
2025-05-07T20:32:35.6979417Z 
2025-05-07T20:32:35.6979580Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:35.6980102Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.6980562Z                            module_map=module_map)
2025-05-07T20:32:35.6980925Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.6981359Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:35.6981617Z E       ^
2025-05-07T20:32:35.6982072Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.6982522Z 
2025-05-07T20:32:35.6982935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:35.9220021Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:35.9222232Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:32:35.9224724Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:35.9226729Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:35.9227783Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:35.9229068Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:35.9230442Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.9231920Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:35.9233277Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.9234308Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                        module_map=module_map)
2025-05-07T20:32:35.9235599Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:35.9236843Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:32:35.9237669Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:35.9238853Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:35.9240040Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:32:35.9241322Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:35.9242324Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:32:35.9243657Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:35.9244918Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:35.9245853Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:35.9246916Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:35.9247948Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:32:35.9248703Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:35.9249850Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:35.9251178Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:35.9252223Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.9253106Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.9253948Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:32:35.9254954Z W0507 20:32:35.918000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:35.9827908Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:35.9829199Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:32:35.9830536Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:35.9831958Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:35.9832931Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:35.9834217Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:35.9835614Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:35.9836960Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:35.9838505Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:35.9839542Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                        module_map=module_map)
2025-05-07T20:32:35.9841053Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:35.9842288Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:32:35.9843113Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:35.9844302Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:35.9845495Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:32:35.9846508Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:35.9847507Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:32:35.9848697Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:35.9850099Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:35.9850981Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:35.9852035Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:35.9853054Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:32:35.9853807Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:35.9854967Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:35.9856288Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:35.9857327Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:35.9858219Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:35.9858948Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:32:35.9859945Z W0507 20:32:35.979000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.1704020Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:36.1705112Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:32:36.1706491Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:36.1707995Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:36.1709144Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:36.1710433Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:36.1711793Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.1713080Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:36.1714433Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.1715651Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                        module_map=module_map)
2025-05-07T20:32:36.1716913Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:36.1718142Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:32:36.1718973Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:36.1720176Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:36.1721386Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:32:36.1722403Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:36.1723419Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:32:36.1724627Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:36.1725907Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:36.1726906Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:36.1727985Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:36.1729037Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:32:36.1729809Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:36.1730973Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:36.1732342Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:36.1733396Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.1734315Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.1735050Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:32:36.1736111Z W0507 20:32:36.166000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.1794313Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:36.1795756Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] Traceback (most recent call last):
2025-05-07T20:32:36.1797091Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:36.1798509Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:36.1799582Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:36.1800894Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:36.1802272Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.1803831Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:36.1805304Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.1806351Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]                        module_map=module_map)
2025-05-07T20:32:36.1807763Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:36.1809005Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     generator.visit(fn.parse())
2025-05-07T20:32:36.1809883Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:36.1811334Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:36.1812547Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ret = super().visit(node)
2025-05-07T20:32:36.1813590Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:36.1814757Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     return visitor(node)
2025-05-07T20:32:36.1816020Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:36.1817292Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:36.1818188Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:36.1819375Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:36.1820407Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     self.visit(item)
2025-05-07T20:32:36.1821169Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:36.1822324Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:36.1823689Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:36.1824747Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.1825679Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.1826414Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ^
2025-05-07T20:32:36.1827421Z W0507 20:32:36.176000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.6031103Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:36.6032775Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:32:36.6034535Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:36.6037830Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:36.6038776Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:36.6040054Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:36.6041772Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.6043387Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:36.6045100Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.6046382Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                        module_map=module_map)
2025-05-07T20:32:36.6047954Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:36.6049344Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:32:36.6050169Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:36.6051346Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:36.6052522Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:32:36.6053532Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:36.6054538Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:32:36.6055733Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:36.6056984Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:36.6057857Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:36.6058917Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:36.6060050Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:32:36.6060801Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:36.6061933Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:36.6063253Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:36.6064281Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.6065170Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.6066051Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:32:36.6067035Z W0507 20:32:36.599000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.6643597Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:36.6644845Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:32:36.6646165Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:36.6647812Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:36.6648777Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:36.6650067Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:36.6651426Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.6652729Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:36.6654069Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.6655098Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                        module_map=module_map)
2025-05-07T20:32:36.6663936Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:36.6665173Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:32:36.6666214Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:36.6667424Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:36.6668688Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:32:36.6669704Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:36.6670694Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:32:36.6671898Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:36.6673167Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:36.6674058Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:36.6675126Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:36.6676150Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:32:36.6677162Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:36.6678617Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:36.6680302Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:36.6681604Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.6682710Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.6683593Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:32:36.6684854Z W0507 20:32:36.660000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.8518893Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:36.8520117Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:32:36.8521429Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:36.8522818Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:36.8523959Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:36.8525242Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:36.8526650Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.8527927Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:36.8529281Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.8530307Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                        module_map=module_map)
2025-05-07T20:32:36.8531555Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:36.8532779Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:32:36.8533602Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:36.8534904Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:36.8536091Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:32:36.8537112Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:36.8538112Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:32:36.8539307Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:36.8540747Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:36.8541632Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:36.8542704Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:36.8543729Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:32:36.8544486Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:36.8545635Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:36.8547103Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:36.8548210Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.8549118Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.8549842Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:32:36.8550836Z W0507 20:32:36.848000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:36.8610984Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:36.8612354Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] Traceback (most recent call last):
2025-05-07T20:32:36.8614015Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:36.8615802Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:36.8616804Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:36.8618243Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:36.8619592Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:36.8620862Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:36.8622206Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:36.8623232Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]                        module_map=module_map)
2025-05-07T20:32:36.8624463Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:36.8625686Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     generator.visit(fn.parse())
2025-05-07T20:32:36.8626560Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:36.8627813Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:36.8628996Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ret = super().visit(node)
2025-05-07T20:32:36.8630083Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:36.8631075Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     return visitor(node)
2025-05-07T20:32:36.8632265Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:36.8633506Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:36.8634392Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:36.8635492Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:36.8636560Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     self.visit(item)
2025-05-07T20:32:36.8637305Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:36.8638439Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:36.8639762Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:36.8641076Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:36.8641964Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:36.8642677Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ^
2025-05-07T20:32:36.8643665Z W0507 20:32:36.857000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.2475759Z 
2025-05-07T20:32:37.2476209Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.2476828Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.2477444Z     T=2048,
2025-05-07T20:32:37.2477709Z     D=5120,
2025-05-07T20:32:37.2477967Z     scale_ub=None,
2025-05-07T20:32:37.2478244Z     contiguous=True,
2025-05-07T20:32:37.2478474Z     compiled=True,
2025-05-07T20:32:37.2478676Z )
2025-05-07T20:32:37.2479000Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.2479486Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:37.2479758Z 
2025-05-07T20:32:37.2479846Z     @given(
2025-05-07T20:32:37.2480070Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.2480380Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.2480693Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.2481020Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.2481352Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.2481637Z     )
2025-05-07T20:32:37.2481988Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.2482451Z     def test_silu_mul_quant(
2025-05-07T20:32:37.2482687Z         self,
2025-05-07T20:32:37.2483056Z         T: int,
2025-05-07T20:32:37.2483256Z         D: int,
2025-05-07T20:32:37.2483474Z         scale_ub: Optional[float],
2025-05-07T20:32:37.2483738Z         contiguous: bool,
2025-05-07T20:32:37.2483976Z         compiled: bool,
2025-05-07T20:32:37.2484205Z     ) -> None:
2025-05-07T20:32:37.2484412Z         torch.manual_seed(2025)
2025-05-07T20:32:37.2484652Z     
2025-05-07T20:32:37.2484920Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.2485260Z     
2025-05-07T20:32:37.2485447Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.2485739Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.2486048Z         x = x_sign * x_clamp
2025-05-07T20:32:37.2486277Z         x0 = x[:, :D]
2025-05-07T20:32:37.2486494Z         x1 = x[:, D:]
2025-05-07T20:32:37.2486707Z     
2025-05-07T20:32:37.2486886Z         if contiguous:
2025-05-07T20:32:37.2487118Z             x0 = x0.contiguous()
2025-05-07T20:32:37.2487382Z             x1 = x1.contiguous()
2025-05-07T20:32:37.2487613Z     
2025-05-07T20:32:37.2487806Z         if scale_ub is not None:
2025-05-07T20:32:37.2488082Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.2488409Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.2488723Z             )
2025-05-07T20:32:37.2488920Z         else:
2025-05-07T20:32:37.2489123Z             scale_ub_tensor = None
2025-05-07T20:32:37.2489370Z     
2025-05-07T20:32:37.2489598Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.2489911Z             op = silu_mul_quant
2025-05-07T20:32:37.2490148Z             if compiled:
2025-05-07T20:32:37.2490400Z                 op = torch.compile(op)
2025-05-07T20:32:37.2490701Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.2490963Z     
2025-05-07T20:32:37.2491276Z         y_fp8, y_scale = fn()
2025-05-07T20:32:37.2491560Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:37.2491843Z     
2025-05-07T20:32:37.2492079Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.2492406Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:37.2492686Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:37.2492992Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:37.2493351Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.2493665Z     
2025-05-07T20:32:37.2493858Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:37.2494058Z 
2025-05-07T20:32:37.2494157Z moe/activation_test.py:126: 
2025-05-07T20:32:37.2494450Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.2494772Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:37.2495092Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.2495904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:37.2496637Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:37.2497178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.2497878Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.2498566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:37.2499269Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.2499989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:37.2500619Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:37.2501217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:37.2501718Z     fn()
2025-05-07T20:32:37.2502309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:37.2502884Z     self.fn.run(
2025-05-07T20:32:37.2503343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.2503862Z     kernel = self.compile(
2025-05-07T20:32:37.2504397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.2505038Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.2505422Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.2505650Z 
2025-05-07T20:32:37.2505852Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c7bd3c50>
2025-05-07T20:32:37.2506923Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.2508358Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c77ca020>}
2025-05-07T20:32:37.2509667Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.2510682Z context = <triton._C.libtriton.ir.context object at 0x7f13c6cfc870>
2025-05-07T20:32:37.2510964Z 
2025-05-07T20:32:37.2511133Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.2511638Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.2512183Z                            module_map=module_map)
2025-05-07T20:32:37.2512552Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.2512901Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:37.2513164Z E       ^
2025-05-07T20:32:37.2513618Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.2514054Z 
2025-05-07T20:32:37.2514473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.2514976Z 
2025-05-07T20:32:37.2515075Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:37.2515488Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:37.2515885Z     T=128,
2025-05-07T20:32:37.2516065Z     D=5120,
2025-05-07T20:32:37.2516254Z     scale_ub=None,
2025-05-07T20:32:37.2516470Z     contiguous=True,
2025-05-07T20:32:37.2516686Z     compiled=True,
2025-05-07T20:32:37.2516889Z )
2025-05-07T20:32:37.2517205Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:37.2517684Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:37.2517943Z 
2025-05-07T20:32:37.2518021Z     @given(
2025-05-07T20:32:37.2518248Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:37.2518557Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:37.2518851Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:37.2519170Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:37.2519499Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:37.2519772Z     )
2025-05-07T20:32:37.2520120Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:37.2520570Z     def test_silu_mul_quant(
2025-05-07T20:32:37.2520814Z         self,
2025-05-07T20:32:37.2521002Z         T: int,
2025-05-07T20:32:37.2521201Z         D: int,
2025-05-07T20:32:37.2521417Z         scale_ub: Optional[float],
2025-05-07T20:32:37.2522052Z         contiguous: bool,
2025-05-07T20:32:37.2522296Z         compiled: bool,
2025-05-07T20:32:37.2522511Z     ) -> None:
2025-05-07T20:32:37.2522719Z         torch.manual_seed(2025)
2025-05-07T20:32:37.2522957Z     
2025-05-07T20:32:37.2523223Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:37.2523546Z     
2025-05-07T20:32:37.2523735Z         x_sign = torch.sign(x)
2025-05-07T20:32:37.2524020Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:37.2524320Z         x = x_sign * x_clamp
2025-05-07T20:32:37.2524557Z         x0 = x[:, :D]
2025-05-07T20:32:37.2524767Z         x1 = x[:, D:]
2025-05-07T20:32:37.2524969Z     
2025-05-07T20:32:37.2525154Z         if contiguous:
2025-05-07T20:32:37.2525382Z             x0 = x0.contiguous()
2025-05-07T20:32:37.2525635Z             x1 = x1.contiguous()
2025-05-07T20:32:37.2525874Z     
2025-05-07T20:32:37.2526062Z         if scale_ub is not None:
2025-05-07T20:32:37.2526326Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:37.2526657Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:37.2526958Z             )
2025-05-07T20:32:37.2527143Z         else:
2025-05-07T20:32:37.2527343Z             scale_ub_tensor = None
2025-05-07T20:32:37.2527583Z     
2025-05-07T20:32:37.2527807Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.2528107Z             op = silu_mul_quant
2025-05-07T20:32:37.2528355Z             if compiled:
2025-05-07T20:32:37.2528602Z                 op = torch.compile(op)
2025-05-07T20:32:37.2528890Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:37.2529163Z     
2025-05-07T20:32:37.2529352Z         y_fp8, y_scale = fn()
2025-05-07T20:32:37.2529631Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:37.2530004Z     
2025-05-07T20:32:37.2530235Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:37.2530557Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:37.2530844Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:37.2531149Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:37.2531500Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.2531803Z     
2025-05-07T20:32:37.2532005Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:37.2532193Z 
2025-05-07T20:32:37.2532295Z moe/activation_test.py:126: 
2025-05-07T20:32:37.2532579Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.2532919Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:37.2533242Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:37.2534019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:37.2534759Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:37.2535304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:37.2536025Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:37.2536693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:37.2537401Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:37.2538122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:37.2538753Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:37.2539358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:37.2539869Z     fn()
2025-05-07T20:32:37.2540652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:37.2541377Z     self.fn.run(
2025-05-07T20:32:37.2541839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:37.2542359Z     kernel = self.compile(
2025-05-07T20:32:37.2542891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:37.2543521Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.2543912Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:37.2544136Z 
2025-05-07T20:32:37.2544343Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c6c88e50>
2025-05-07T20:32:37.2545405Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:37.2546758Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c679b420>}
2025-05-07T20:32:37.2548120Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:37.2549128Z context = <triton._C.libtriton.ir.context object at 0x7f13c6570770>
2025-05-07T20:32:37.2549408Z 
2025-05-07T20:32:37.2549579Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:37.2550093Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.2550564Z                            module_map=module_map)
2025-05-07T20:32:37.2551048Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.2551406Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:37.2551663Z E       ^
2025-05-07T20:32:37.2552126Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.2552584Z 
2025-05-07T20:32:37.2553001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:37.4833248Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:37.4834537Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:32:37.4835870Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:37.4837312Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:37.4838259Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:37.4839542Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:37.4841184Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.4842977Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:37.4844332Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.4845351Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                        module_map=module_map)
2025-05-07T20:32:37.4846598Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:37.4847818Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:32:37.4848652Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:37.4849825Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:37.4851000Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:32:37.4852018Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:37.4853015Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:32:37.4854341Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:37.4855596Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:37.4856483Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:37.4857571Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:37.4858585Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:32:37.4859350Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:37.4860498Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:37.4861823Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:37.4862859Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.4863775Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.4864497Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:32:37.4865568Z W0507 20:32:37.479000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.5446809Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:37.5448040Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:32:37.5449347Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:37.5450733Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:37.5451721Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:37.5452997Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:37.5454349Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.5455620Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:37.5457130Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.5458161Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                        module_map=module_map)
2025-05-07T20:32:37.5459399Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:37.5460621Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:32:37.5461439Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:37.5462624Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:37.5463796Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:32:37.5464801Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:37.5465797Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:32:37.5467035Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:37.5468477Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:37.5469353Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:37.5470416Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:37.5471437Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:32:37.5472212Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:37.5473348Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:37.5474688Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:37.5475739Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.5476682Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.5477415Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:32:37.5478401Z W0507 20:32:37.541000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.7341875Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:37.7343003Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:32:37.7344324Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:37.7345902Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:37.7346867Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:37.7348270Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:37.7349635Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.7350919Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:37.7352290Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.7353771Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                        module_map=module_map)
2025-05-07T20:32:37.7355061Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:37.7356306Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:32:37.7357137Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:37.7358318Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:37.7359513Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:32:37.7360547Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:37.7361558Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:32:37.7362764Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:37.7364027Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:37.7365081Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:37.7366148Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:37.7367178Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:32:37.7367937Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:37.7369095Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:37.7370447Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:37.7381515Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.7382449Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.7383188Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:32:37.7384230Z W0507 20:32:37.730000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:37.7435701Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:37.7436912Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] Traceback (most recent call last):
2025-05-07T20:32:37.7438301Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:37.7439737Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:37.7441568Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:37.7442862Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:37.7444237Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:37.7445520Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:37.7446870Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:37.7447894Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]                        module_map=module_map)
2025-05-07T20:32:37.7449315Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:37.7450540Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     generator.visit(fn.parse())
2025-05-07T20:32:37.7451387Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:37.7452566Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:37.7453757Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ret = super().visit(node)
2025-05-07T20:32:37.7454781Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:37.7455783Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     return visitor(node)
2025-05-07T20:32:37.7456978Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:37.7458233Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:37.7459114Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:37.7460182Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:37.7461338Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     self.visit(item)
2025-05-07T20:32:37.7462103Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:37.7463276Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:37.7464598Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:37.7465639Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:37.7466543Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:37.7467273Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ^
2025-05-07T20:32:37.7468350Z W0507 20:32:37.740000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.2028285Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:38.2029387Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:32:38.2031130Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:38.2032552Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:38.2033510Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:38.2034797Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:38.2036215Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.2037512Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:38.2038870Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.2039913Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                        module_map=module_map)
2025-05-07T20:32:38.2041412Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:38.2042797Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:32:38.2043631Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:38.2044815Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:38.2045995Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:32:38.2047018Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:38.2048044Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:32:38.2049276Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:38.2050556Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:38.2051441Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:38.2052503Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:38.2053528Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:32:38.2054450Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:38.2055588Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:38.2056923Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:38.2057978Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.2058875Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.2059613Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:32:38.2060614Z W0507 20:32:38.199000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.2640498Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:38.2641549Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:32:38.2642864Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:38.2644573Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:38.2645557Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:38.2646838Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:38.2648254Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.2649532Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:38.2650898Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.2651955Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                        module_map=module_map)
2025-05-07T20:32:38.2653197Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:38.2654444Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:32:38.2655475Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:38.2656662Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:38.2657847Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:32:38.2658893Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:38.2659903Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:32:38.2661106Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:38.2662387Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:38.2663290Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:38.2664365Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:38.2665395Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:32:38.2666157Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:38.2667426Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:38.2668829Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:38.2669875Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.2670783Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.2671520Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:32:38.2672517Z W0507 20:32:38.260000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.4536088Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:38.4537530Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:32:38.4538854Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:38.4540523Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:38.4541758Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:38.4543060Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:38.4544429Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.4545719Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:38.4547078Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.4548194Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                        module_map=module_map)
2025-05-07T20:32:38.4549434Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:38.4550669Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:32:38.4551506Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:38.4552695Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:38.4554036Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:32:38.4555047Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:38.4556057Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:32:38.4557262Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:38.4558532Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:38.4559439Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:38.4560505Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:38.4561656Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:32:38.4562578Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:38.4563738Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:38.4565186Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:38.4566231Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.4567134Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.4567867Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:32:38.4568872Z W0507 20:32:38.450000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.4627169Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:38.4628282Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] Traceback (most recent call last):
2025-05-07T20:32:38.4629594Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:38.4631037Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:38.4631995Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:38.4633484Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:38.4634908Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.4636192Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:38.4637542Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.4638577Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]                        module_map=module_map)
2025-05-07T20:32:38.4639837Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:38.4641310Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     generator.visit(fn.parse())
2025-05-07T20:32:38.4642140Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:38.4643323Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:38.4644515Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ret = super().visit(node)
2025-05-07T20:32:38.4645685Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:38.4646705Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     return visitor(node)
2025-05-07T20:32:38.4647907Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:38.4649171Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:38.4650068Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:38.4651140Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:38.4652176Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     self.visit(item)
2025-05-07T20:32:38.4652935Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:38.4654094Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:38.4655435Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:38.4657261Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.4658239Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.4658974Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ^
2025-05-07T20:32:38.4659980Z W0507 20:32:38.459000 88176 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [1/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.6877233Z 
2025-05-07T20:32:38.6877713Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.6878163Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.6878584Z     T=4096,
2025-05-07T20:32:38.6878842Z     D=5120,
2025-05-07T20:32:38.6879040Z     scale_ub=None,
2025-05-07T20:32:38.6879259Z     contiguous=True,
2025-05-07T20:32:38.6879492Z     compiled=True,
2025-05-07T20:32:38.6879731Z )
2025-05-07T20:32:38.6880086Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.6880579Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:38.6880862Z 
2025-05-07T20:32:38.6880943Z     @given(
2025-05-07T20:32:38.6881181Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.6881497Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.6881808Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.6882151Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.6882484Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.6882766Z     )
2025-05-07T20:32:38.6883132Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.6883924Z     def test_silu_mul_quant(
2025-05-07T20:32:38.6884167Z         self,
2025-05-07T20:32:38.6884369Z         T: int,
2025-05-07T20:32:38.6884579Z         D: int,
2025-05-07T20:32:38.6884802Z         scale_ub: Optional[float],
2025-05-07T20:32:38.6885085Z         contiguous: bool,
2025-05-07T20:32:38.6885334Z         compiled: bool,
2025-05-07T20:32:38.6885559Z     ) -> None:
2025-05-07T20:32:38.6885790Z         torch.manual_seed(2025)
2025-05-07T20:32:38.6886046Z     
2025-05-07T20:32:38.6886350Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.6886712Z     
2025-05-07T20:32:38.6886914Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.6887210Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.6887516Z         x = x_sign * x_clamp
2025-05-07T20:32:38.6887760Z         x0 = x[:, :D]
2025-05-07T20:32:38.6887981Z         x1 = x[:, D:]
2025-05-07T20:32:38.6888184Z     
2025-05-07T20:32:38.6888376Z         if contiguous:
2025-05-07T20:32:38.6888621Z             x0 = x0.contiguous()
2025-05-07T20:32:38.6888878Z             x1 = x1.contiguous()
2025-05-07T20:32:38.6889128Z     
2025-05-07T20:32:38.6889340Z         if scale_ub is not None:
2025-05-07T20:32:38.6889612Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.6889953Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.6890270Z             )
2025-05-07T20:32:38.6890457Z         else:
2025-05-07T20:32:38.6890672Z             scale_ub_tensor = None
2025-05-07T20:32:38.6890932Z     
2025-05-07T20:32:38.6891162Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.6891481Z             op = silu_mul_quant
2025-05-07T20:32:38.6891738Z             if compiled:
2025-05-07T20:32:38.6891997Z                 op = torch.compile(op)
2025-05-07T20:32:38.6892288Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.6892564Z     
2025-05-07T20:32:38.6892766Z         y_fp8, y_scale = fn()
2025-05-07T20:32:38.6893056Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:38.6893353Z     
2025-05-07T20:32:38.6893596Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.6894080Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:38.6894383Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:38.6894701Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:38.6895056Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:38.6895370Z     
2025-05-07T20:32:38.6895580Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:38.6895780Z 
2025-05-07T20:32:38.6895898Z moe/activation_test.py:126: 
2025-05-07T20:32:38.6896194Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.6896535Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:38.6896869Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:38.6897645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:38.6898400Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:38.6898955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.6899636Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.6900320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:38.6901068Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:38.6901801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:38.6902432Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:38.6903028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:38.6903710Z     fn()
2025-05-07T20:32:38.6904234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:38.6904826Z     self.fn.run(
2025-05-07T20:32:38.6905299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.6905827Z     kernel = self.compile(
2025-05-07T20:32:38.6906378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.6907029Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.6907430Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.6907739Z 
2025-05-07T20:32:38.6907957Z self = <triton.compiler.compiler.ASTSource object at 0x7f13f9713b50>
2025-05-07T20:32:38.6909033Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.6910428Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c67eaac0>}
2025-05-07T20:32:38.6911792Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.6912811Z context = <triton._C.libtriton.ir.context object at 0x7f13c5f27030>
2025-05-07T20:32:38.6913092Z 
2025-05-07T20:32:38.6913268Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.6913794Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.6914271Z                            module_map=module_map)
2025-05-07T20:32:38.6914642Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.6915087Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:38.6915357Z E       ^
2025-05-07T20:32:38.6915816Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.6916259Z 
2025-05-07T20:32:38.6916692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.6917243Z 
2025-05-07T20:32:38.6917348Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.6917761Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.6918172Z     T=16384,
2025-05-07T20:32:38.6918360Z     D=5120,
2025-05-07T20:32:38.6918558Z     scale_ub=None,
2025-05-07T20:32:38.6918778Z     contiguous=True,
2025-05-07T20:32:38.6919007Z     compiled=True,
2025-05-07T20:32:38.6919212Z )
2025-05-07T20:32:38.6919533Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.6920032Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:38.6920314Z 
2025-05-07T20:32:38.6920394Z     @given(
2025-05-07T20:32:38.6920631Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.6920947Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.6921246Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.6921579Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.6921911Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.6922190Z     )
2025-05-07T20:32:38.6922541Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.6922994Z     def test_silu_mul_quant(
2025-05-07T20:32:38.6923250Z         self,
2025-05-07T20:32:38.6923444Z         T: int,
2025-05-07T20:32:38.6923740Z         D: int,
2025-05-07T20:32:38.6923967Z         scale_ub: Optional[float],
2025-05-07T20:32:38.6924235Z         contiguous: bool,
2025-05-07T20:32:38.6924484Z         compiled: bool,
2025-05-07T20:32:38.6924708Z     ) -> None:
2025-05-07T20:32:38.6924922Z         torch.manual_seed(2025)
2025-05-07T20:32:38.6925170Z     
2025-05-07T20:32:38.6925449Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.6925780Z     
2025-05-07T20:32:38.6925979Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.6926272Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.6926577Z         x = x_sign * x_clamp
2025-05-07T20:32:38.6926818Z         x0 = x[:, :D]
2025-05-07T20:32:38.6927039Z         x1 = x[:, D:]
2025-05-07T20:32:38.6927242Z     
2025-05-07T20:32:38.6927430Z         if contiguous:
2025-05-07T20:32:38.6927664Z             x0 = x0.contiguous()
2025-05-07T20:32:38.6927926Z             x1 = x1.contiguous()
2025-05-07T20:32:38.6928157Z     
2025-05-07T20:32:38.6928361Z         if scale_ub is not None:
2025-05-07T20:32:38.6928636Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.6928969Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.6929280Z             )
2025-05-07T20:32:38.6929479Z         else:
2025-05-07T20:32:38.6929684Z             scale_ub_tensor = None
2025-05-07T20:32:38.6929936Z     
2025-05-07T20:32:38.6930169Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.6930477Z             op = silu_mul_quant
2025-05-07T20:32:38.6930732Z             if compiled:
2025-05-07T20:32:38.6930984Z                 op = torch.compile(op)
2025-05-07T20:32:38.6931273Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.6931552Z     
2025-05-07T20:32:38.6931751Z         y_fp8, y_scale = fn()
2025-05-07T20:32:38.6932030Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:38.6932324Z     
2025-05-07T20:32:38.6932565Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.6932907Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:38.6933193Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:38.6933598Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:38.6933966Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:38.6934267Z     
2025-05-07T20:32:38.6934472Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:38.6934664Z 
2025-05-07T20:32:38.6934773Z moe/activation_test.py:126: 
2025-05-07T20:32:38.6935065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.6935403Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:38.6935730Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:38.6936516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:38.6937253Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:38.6937805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.6938490Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.6939174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:38.6939897Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:38.6940966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:38.6941603Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:38.6942200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:38.6942715Z     fn()
2025-05-07T20:32:38.6943244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:38.6943999Z     self.fn.run(
2025-05-07T20:32:38.6944470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.6944997Z     kernel = self.compile(
2025-05-07T20:32:38.6945540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.6946193Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.6946634Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.6946864Z 
2025-05-07T20:32:38.6947067Z self = <triton.compiler.compiler.ASTSource object at 0x7f13f8ad3850>
2025-05-07T20:32:38.6948204Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.6949556Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13f8b5a520>}
2025-05-07T20:32:38.6950875Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.6951882Z context = <triton._C.libtriton.ir.context object at 0x7f13c5afff70>
2025-05-07T20:32:38.6952163Z 
2025-05-07T20:32:38.6952333Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.6952858Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.6953328Z                            module_map=module_map)
2025-05-07T20:32:38.6953691Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.6954049Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:38.6954308Z E       ^
2025-05-07T20:32:38.6954917Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.6955358Z 
2025-05-07T20:32:38.6955782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.7174724Z W0507 20:32:38.716000 88176 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:32:38.7176948Z W0507 20:32:38.716000 88176 site-packages/torch/_dynamo/convert_frame.py:987] [1/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:32:38.7178314Z W0507 20:32:38.716000 88176 site-packages/torch/_dynamo/convert_frame.py:987] [1/8]    last reason: 1/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:32:38.7179315Z W0507 20:32:38.716000 88176 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:32:38.7180404Z W0507 20:32:38.716000 88176 site-packages/torch/_dynamo/convert_frame.py:987] [1/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:32:38.9312460Z 
2025-05-07T20:32:38.9312821Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.9313265Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.9313711Z     T=1,
2025-05-07T20:32:38.9313892Z     D=5120,
2025-05-07T20:32:38.9314085Z     scale_ub=1200.0,
2025-05-07T20:32:38.9324073Z     contiguous=True,
2025-05-07T20:32:38.9324349Z     compiled=True,
2025-05-07T20:32:38.9324555Z )
2025-05-07T20:32:38.9324869Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.9325748Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:38.9326013Z 
2025-05-07T20:32:38.9326092Z     @given(
2025-05-07T20:32:38.9326329Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.9326643Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.9326944Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.9327277Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.9327612Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.9327900Z     )
2025-05-07T20:32:38.9328251Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.9328710Z     def test_silu_mul_quant(
2025-05-07T20:32:38.9328948Z         self,
2025-05-07T20:32:38.9329152Z         T: int,
2025-05-07T20:32:38.9329357Z         D: int,
2025-05-07T20:32:38.9329571Z         scale_ub: Optional[float],
2025-05-07T20:32:38.9329847Z         contiguous: bool,
2025-05-07T20:32:38.9330091Z         compiled: bool,
2025-05-07T20:32:38.9330322Z     ) -> None:
2025-05-07T20:32:38.9330532Z         torch.manual_seed(2025)
2025-05-07T20:32:38.9330774Z     
2025-05-07T20:32:38.9331056Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.9331395Z     
2025-05-07T20:32:38.9331589Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.9331882Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.9332188Z         x = x_sign * x_clamp
2025-05-07T20:32:38.9332420Z         x0 = x[:, :D]
2025-05-07T20:32:38.9332633Z         x1 = x[:, D:]
2025-05-07T20:32:38.9332845Z     
2025-05-07T20:32:38.9333027Z         if contiguous:
2025-05-07T20:32:38.9333257Z             x0 = x0.contiguous()
2025-05-07T20:32:38.9333515Z             x1 = x1.contiguous()
2025-05-07T20:32:38.9333752Z     
2025-05-07T20:32:38.9333938Z         if scale_ub is not None:
2025-05-07T20:32:38.9334213Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.9334540Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.9334856Z             )
2025-05-07T20:32:38.9335053Z         else:
2025-05-07T20:32:38.9335411Z             scale_ub_tensor = None
2025-05-07T20:32:38.9335664Z     
2025-05-07T20:32:38.9335893Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.9336200Z             op = silu_mul_quant
2025-05-07T20:32:38.9336448Z             if compiled:
2025-05-07T20:32:38.9336694Z                 op = torch.compile(op)
2025-05-07T20:32:38.9336988Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.9337260Z     
2025-05-07T20:32:38.9337451Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:38.9337615Z 
2025-05-07T20:32:38.9337722Z moe/activation_test.py:117: 
2025-05-07T20:32:38.9338009Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.9338339Z moe/activation_test.py:115: in fn
2025-05-07T20:32:38.9338615Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.9339176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:38.9339727Z     return fn(*args, **kwargs)
2025-05-07T20:32:38.9340798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:38.9341518Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:38.9342059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.9342743Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.9343404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.9343926Z     kernel = self.compile(
2025-05-07T20:32:38.9344493Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.9345281Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.9345682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.9345907Z 
2025-05-07T20:32:38.9346118Z self = <triton.compiler.compiler.ASTSource object at 0x7f13e11ad550>
2025-05-07T20:32:38.9347196Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.9348652Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c5d0f1a0>}
2025-05-07T20:32:38.9350007Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.9351024Z context = <triton._C.libtriton.ir.context object at 0x7f13c59777b0>
2025-05-07T20:32:38.9351307Z 
2025-05-07T20:32:38.9351481Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.9352007Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.9352484Z                            module_map=module_map)
2025-05-07T20:32:38.9352843Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.9353201Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:38.9353461Z E       ^
2025-05-07T20:32:38.9353919Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.9354369Z 
2025-05-07T20:32:38.9354777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:38.9355289Z 
2025-05-07T20:32:38.9355396Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:38.9355800Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:38.9356334Z     T=1,
2025-05-07T20:32:38.9356516Z     D=5120,
2025-05-07T20:32:38.9356714Z     scale_ub=None,
2025-05-07T20:32:38.9356927Z     contiguous=False,
2025-05-07T20:32:38.9357146Z     compiled=True,
2025-05-07T20:32:38.9357355Z )
2025-05-07T20:32:38.9357678Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:38.9358154Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:38.9358423Z 
2025-05-07T20:32:38.9358502Z     @given(
2025-05-07T20:32:38.9358733Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:38.9359036Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:38.9359343Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:38.9359668Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:38.9359989Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:38.9360270Z     )
2025-05-07T20:32:38.9360623Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:38.9361068Z     def test_silu_mul_quant(
2025-05-07T20:32:38.9361305Z         self,
2025-05-07T20:32:38.9361497Z         T: int,
2025-05-07T20:32:38.9361694Z         D: int,
2025-05-07T20:32:38.9361902Z         scale_ub: Optional[float],
2025-05-07T20:32:38.9362178Z         contiguous: bool,
2025-05-07T20:32:38.9362416Z         compiled: bool,
2025-05-07T20:32:38.9362634Z     ) -> None:
2025-05-07T20:32:38.9362844Z         torch.manual_seed(2025)
2025-05-07T20:32:38.9363099Z     
2025-05-07T20:32:38.9363364Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:38.9363713Z     
2025-05-07T20:32:38.9363905Z         x_sign = torch.sign(x)
2025-05-07T20:32:38.9364190Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:38.9364499Z         x = x_sign * x_clamp
2025-05-07T20:32:38.9364885Z         x0 = x[:, :D]
2025-05-07T20:32:38.9365100Z         x1 = x[:, D:]
2025-05-07T20:32:38.9365311Z     
2025-05-07T20:32:38.9365500Z         if contiguous:
2025-05-07T20:32:38.9365729Z             x0 = x0.contiguous()
2025-05-07T20:32:38.9365978Z             x1 = x1.contiguous()
2025-05-07T20:32:38.9366245Z     
2025-05-07T20:32:38.9366472Z         if scale_ub is not None:
2025-05-07T20:32:38.9366743Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:38.9367075Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:38.9367382Z             )
2025-05-07T20:32:38.9367576Z         else:
2025-05-07T20:32:38.9367787Z             scale_ub_tensor = None
2025-05-07T20:32:38.9368036Z     
2025-05-07T20:32:38.9368257Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.9368565Z             op = silu_mul_quant
2025-05-07T20:32:38.9368822Z             if compiled:
2025-05-07T20:32:38.9369062Z                 op = torch.compile(op)
2025-05-07T20:32:38.9369361Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:38.9369637Z     
2025-05-07T20:32:38.9369823Z         y_fp8, y_scale = fn()
2025-05-07T20:32:38.9370104Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:38.9370391Z     
2025-05-07T20:32:38.9370626Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:38.9370947Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:38.9371231Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:38.9371538Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:38.9371886Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:38.9372198Z     
2025-05-07T20:32:38.9372395Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:38.9372585Z 
2025-05-07T20:32:38.9372680Z moe/activation_test.py:126: 
2025-05-07T20:32:38.9372977Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.9373310Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:38.9373632Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:38.9374487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:38.9375231Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:38.9375768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:38.9376462Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:38.9377172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:38.9377883Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:38.9378625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:38.9379255Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:38.9379870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:38.9380390Z     fn()
2025-05-07T20:32:38.9380923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:38.9381510Z     self.fn.run(
2025-05-07T20:32:38.9381979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:38.9382498Z     kernel = self.compile(
2025-05-07T20:32:38.9383059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:38.9383706Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:38.9384108Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:38.9384415Z 
2025-05-07T20:32:38.9384647Z self = <triton.compiler.compiler.ASTSource object at 0x7f13e81864d0>
2025-05-07T20:32:38.9385979Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:38.9387679Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c64782c0>}
2025-05-07T20:32:38.9389028Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:38.9390032Z context = <triton._C.libtriton.ir.context object at 0x7f13c59802f0>
2025-05-07T20:32:38.9390320Z 
2025-05-07T20:32:38.9390494Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:38.9391017Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:38.9391481Z                            module_map=module_map)
2025-05-07T20:32:38.9391838Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:38.9392203Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:38.9392463Z E       ^
2025-05-07T20:32:38.9392930Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:38.9393378Z 
2025-05-07T20:32:38.9393805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.0771502Z 
2025-05-07T20:32:39.0771899Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.0772365Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.0772827Z     T=1,
2025-05-07T20:32:39.0773020Z     D=5120,
2025-05-07T20:32:39.0773208Z     scale_ub=None,
2025-05-07T20:32:39.0773429Z     contiguous=True,
2025-05-07T20:32:39.0773866Z     compiled=False,
2025-05-07T20:32:39.0774069Z )
2025-05-07T20:32:39.0774391Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.0774918Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:39.0775195Z 
2025-05-07T20:32:39.0775277Z     @given(
2025-05-07T20:32:39.0775504Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.0775807Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.0776118Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.0776451Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.0776768Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.0777052Z     )
2025-05-07T20:32:39.0777397Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.0777864Z     def test_silu_mul_quant(
2025-05-07T20:32:39.0778099Z         self,
2025-05-07T20:32:39.0778289Z         T: int,
2025-05-07T20:32:39.0778486Z         D: int,
2025-05-07T20:32:39.0778692Z         scale_ub: Optional[float],
2025-05-07T20:32:39.0778954Z         contiguous: bool,
2025-05-07T20:32:39.0779188Z         compiled: bool,
2025-05-07T20:32:39.0779401Z     ) -> None:
2025-05-07T20:32:39.0779614Z         torch.manual_seed(2025)
2025-05-07T20:32:39.0779848Z     
2025-05-07T20:32:39.0780107Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.0780444Z     
2025-05-07T20:32:39.0780627Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.0780905Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.0781209Z         x = x_sign * x_clamp
2025-05-07T20:32:39.0781442Z         x0 = x[:, :D]
2025-05-07T20:32:39.0781642Z         x1 = x[:, D:]
2025-05-07T20:32:39.0781842Z     
2025-05-07T20:32:39.0782177Z         if contiguous:
2025-05-07T20:32:39.0782400Z             x0 = x0.contiguous()
2025-05-07T20:32:39.0782659Z             x1 = x1.contiguous()
2025-05-07T20:32:39.0782904Z     
2025-05-07T20:32:39.0783089Z         if scale_ub is not None:
2025-05-07T20:32:39.0783361Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.0783691Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.0783995Z             )
2025-05-07T20:32:39.0784173Z         else:
2025-05-07T20:32:39.0784389Z             scale_ub_tensor = None
2025-05-07T20:32:39.0784643Z     
2025-05-07T20:32:39.0784862Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.0785167Z             op = silu_mul_quant
2025-05-07T20:32:39.0785417Z             if compiled:
2025-05-07T20:32:39.0785652Z                 op = torch.compile(op)
2025-05-07T20:32:39.0785942Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.0786231Z     
2025-05-07T20:32:39.0786447Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.0786626Z 
2025-05-07T20:32:39.0786727Z moe/activation_test.py:117: 
2025-05-07T20:32:39.0787023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.0787354Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.0787697Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.0788391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.0789078Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.0789606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.0790282Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.0790945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.0791472Z     kernel = self.compile(
2025-05-07T20:32:39.0792005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.0792743Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.0793131Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.0793356Z 
2025-05-07T20:32:39.0793558Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c756b350>
2025-05-07T20:32:39.0794673Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.0796039Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c63eeac0>}
2025-05-07T20:32:39.0797368Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.0798378Z context = <triton._C.libtriton.ir.context object at 0x7f13c5146430>
2025-05-07T20:32:39.0798658Z 
2025-05-07T20:32:39.0798817Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.0799332Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.0799788Z                            module_map=module_map)
2025-05-07T20:32:39.0800144Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.0800487Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.0800744Z E       ^
2025-05-07T20:32:39.0801370Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.0801812Z 
2025-05-07T20:32:39.0802320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.0802827Z 
2025-05-07T20:32:39.0802934Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.0803333Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.0803731Z     T=128,
2025-05-07T20:32:39.0803905Z     D=5120,
2025-05-07T20:32:39.0804093Z     scale_ub=None,
2025-05-07T20:32:39.0804304Z     contiguous=False,
2025-05-07T20:32:39.0804516Z     compiled=True,
2025-05-07T20:32:39.0804716Z )
2025-05-07T20:32:39.0805025Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.0805493Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:39.0805759Z 
2025-05-07T20:32:39.0805833Z     @given(
2025-05-07T20:32:39.0806057Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.0806360Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.0806660Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.0806988Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.0807313Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.0807586Z     )
2025-05-07T20:32:39.0807926Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.0808359Z     def test_silu_mul_quant(
2025-05-07T20:32:39.0808597Z         self,
2025-05-07T20:32:39.0808790Z         T: int,
2025-05-07T20:32:39.0808992Z         D: int,
2025-05-07T20:32:39.0809201Z         scale_ub: Optional[float],
2025-05-07T20:32:39.0809468Z         contiguous: bool,
2025-05-07T20:32:39.0809703Z         compiled: bool,
2025-05-07T20:32:39.0809914Z     ) -> None:
2025-05-07T20:32:39.0810129Z         torch.manual_seed(2025)
2025-05-07T20:32:39.0810369Z     
2025-05-07T20:32:39.0810628Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.0810968Z     
2025-05-07T20:32:39.0811162Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.0811443Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.0811823Z         x = x_sign * x_clamp
2025-05-07T20:32:39.0812056Z         x0 = x[:, :D]
2025-05-07T20:32:39.0812268Z         x1 = x[:, D:]
2025-05-07T20:32:39.0812467Z     
2025-05-07T20:32:39.0812643Z         if contiguous:
2025-05-07T20:32:39.0812869Z             x0 = x0.contiguous()
2025-05-07T20:32:39.0813110Z             x1 = x1.contiguous()
2025-05-07T20:32:39.0813344Z     
2025-05-07T20:32:39.0813526Z         if scale_ub is not None:
2025-05-07T20:32:39.0813784Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.0814113Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.0814411Z             )
2025-05-07T20:32:39.0814592Z         else:
2025-05-07T20:32:39.0814796Z             scale_ub_tensor = None
2025-05-07T20:32:39.0815042Z     
2025-05-07T20:32:39.0815259Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.0815572Z             op = silu_mul_quant
2025-05-07T20:32:39.0815815Z             if compiled:
2025-05-07T20:32:39.0816061Z                 op = torch.compile(op)
2025-05-07T20:32:39.0816364Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.0816659Z     
2025-05-07T20:32:39.0816843Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.0817002Z 
2025-05-07T20:32:39.0817098Z moe/activation_test.py:117: 
2025-05-07T20:32:39.0817386Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.0817711Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.0817979Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.0818529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:39.0819081Z     return fn(*args, **kwargs)
2025-05-07T20:32:39.0819746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.0820510Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.0821048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.0821718Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.0822366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.0822888Z     kernel = self.compile(
2025-05-07T20:32:39.0823438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.0824077Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.0824458Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.0824686Z 
2025-05-07T20:32:39.0824887Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c6640950>
2025-05-07T20:32:39.0825960Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.0827309Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c647be20>}
2025-05-07T20:32:39.0828772Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.0829780Z context = <triton._C.libtriton.ir.context object at 0x7f13c50131b0>
2025-05-07T20:32:39.0830067Z 
2025-05-07T20:32:39.0830227Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.0830743Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.0831201Z                            module_map=module_map)
2025-05-07T20:32:39.0831674Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.0832020Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.0832275Z E       ^
2025-05-07T20:32:39.0832719Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.0833160Z 
2025-05-07T20:32:39.0833577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.0834078Z 
2025-05-07T20:32:39.0834183Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.0834579Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.0834960Z     T=128,
2025-05-07T20:32:39.0835140Z     D=7168,
2025-05-07T20:32:39.0835326Z     scale_ub=1200.0,
2025-05-07T20:32:39.0835539Z     contiguous=False,
2025-05-07T20:32:39.0835761Z     compiled=False,
2025-05-07T20:32:39.2385126Z )
2025-05-07T20:32:39.2385926Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.2386451Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:39.2386731Z 
2025-05-07T20:32:39.2386815Z     @given(
2025-05-07T20:32:39.2387035Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.2387351Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.2387718Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.2388030Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.2388349Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.2388626Z     )
2025-05-07T20:32:39.2388966Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.2389418Z     def test_silu_mul_quant(
2025-05-07T20:32:39.2390057Z         self,
2025-05-07T20:32:39.2390239Z         T: int,
2025-05-07T20:32:39.2390434Z         D: int,
2025-05-07T20:32:39.2390652Z         scale_ub: Optional[float],
2025-05-07T20:32:39.2390924Z         contiguous: bool,
2025-05-07T20:32:39.2399790Z         compiled: bool,
2025-05-07T20:32:39.2400066Z     ) -> None:
2025-05-07T20:32:39.2400279Z         torch.manual_seed(2025)
2025-05-07T20:32:39.2400524Z     
2025-05-07T20:32:39.2400799Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.2401138Z     
2025-05-07T20:32:39.2401322Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.2401616Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.2401933Z         x = x_sign * x_clamp
2025-05-07T20:32:39.2402163Z         x0 = x[:, :D]
2025-05-07T20:32:39.2402375Z         x1 = x[:, D:]
2025-05-07T20:32:39.2402583Z     
2025-05-07T20:32:39.2402760Z         if contiguous:
2025-05-07T20:32:39.2402990Z             x0 = x0.contiguous()
2025-05-07T20:32:39.2403253Z             x1 = x1.contiguous()
2025-05-07T20:32:39.2403488Z     
2025-05-07T20:32:39.2403677Z         if scale_ub is not None:
2025-05-07T20:32:39.2403954Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.2404279Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.2404627Z             )
2025-05-07T20:32:39.2404825Z         else:
2025-05-07T20:32:39.2405026Z             scale_ub_tensor = None
2025-05-07T20:32:39.2405275Z     
2025-05-07T20:32:39.2405508Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.2405808Z             op = silu_mul_quant
2025-05-07T20:32:39.2406058Z             if compiled:
2025-05-07T20:32:39.2406308Z                 op = torch.compile(op)
2025-05-07T20:32:39.2406605Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.2406870Z     
2025-05-07T20:32:39.2407060Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.2407220Z 
2025-05-07T20:32:39.2407324Z moe/activation_test.py:117: 
2025-05-07T20:32:39.2407613Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.2407939Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.2408431Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.2409118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.2409798Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.2410334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.2411002Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.2411651Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.2412175Z     kernel = self.compile(
2025-05-07T20:32:39.2412730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.2413380Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.2413772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.2414000Z 
2025-05-07T20:32:39.2414202Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c6b522d0>
2025-05-07T20:32:39.2415270Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.2416681Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c647bb00>}
2025-05-07T20:32:39.2417989Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.2419078Z context = <triton._C.libtriton.ir.context object at 0x7f13c5054370>
2025-05-07T20:32:39.2419373Z 
2025-05-07T20:32:39.2419534Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.2420053Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.2420505Z                            module_map=module_map)
2025-05-07T20:32:39.2420868Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.2421219Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.2421472Z E       ^
2025-05-07T20:32:39.2421930Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.2422376Z 
2025-05-07T20:32:39.2422787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.2423293Z 
2025-05-07T20:32:39.2423402Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.2423804Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.2424201Z     T=128,
2025-05-07T20:32:39.2424388Z     D=5120,
2025-05-07T20:32:39.2424575Z     scale_ub=None,
2025-05-07T20:32:39.2424796Z     contiguous=False,
2025-05-07T20:32:39.2425016Z     compiled=False,
2025-05-07T20:32:39.2425213Z )
2025-05-07T20:32:39.2425530Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.2426014Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:39.2426276Z 
2025-05-07T20:32:39.2426365Z     @given(
2025-05-07T20:32:39.2426622Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.2426942Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.2427241Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.2427664Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.2427995Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.2428272Z     )
2025-05-07T20:32:39.2428729Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.2429177Z     def test_silu_mul_quant(
2025-05-07T20:32:39.2429414Z         self,
2025-05-07T20:32:39.2429599Z         T: int,
2025-05-07T20:32:39.2429785Z         D: int,
2025-05-07T20:32:39.2430001Z         scale_ub: Optional[float],
2025-05-07T20:32:39.2430264Z         contiguous: bool,
2025-05-07T20:32:39.2430488Z         compiled: bool,
2025-05-07T20:32:39.2430706Z     ) -> None:
2025-05-07T20:32:39.2430914Z         torch.manual_seed(2025)
2025-05-07T20:32:39.2431143Z     
2025-05-07T20:32:39.2431408Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.2431747Z     
2025-05-07T20:32:39.2431928Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.2432216Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.2432522Z         x = x_sign * x_clamp
2025-05-07T20:32:39.2432750Z         x0 = x[:, :D]
2025-05-07T20:32:39.2432961Z         x1 = x[:, D:]
2025-05-07T20:32:39.2433168Z     
2025-05-07T20:32:39.2433343Z         if contiguous:
2025-05-07T20:32:39.2433567Z             x0 = x0.contiguous()
2025-05-07T20:32:39.2433821Z             x1 = x1.contiguous()
2025-05-07T20:32:39.2434049Z     
2025-05-07T20:32:39.2434235Z         if scale_ub is not None:
2025-05-07T20:32:39.2434509Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.2434837Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.2435131Z             )
2025-05-07T20:32:39.2435330Z         else:
2025-05-07T20:32:39.2435536Z             scale_ub_tensor = None
2025-05-07T20:32:39.2435777Z     
2025-05-07T20:32:39.2436000Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.2436308Z             op = silu_mul_quant
2025-05-07T20:32:39.2436551Z             if compiled:
2025-05-07T20:32:39.2436890Z                 op = torch.compile(op)
2025-05-07T20:32:39.2437189Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.2437454Z     
2025-05-07T20:32:39.2437658Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.2437819Z 
2025-05-07T20:32:39.2437923Z moe/activation_test.py:117: 
2025-05-07T20:32:39.2438208Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.2438546Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.2438828Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.2439536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.2440572Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.2441110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.2441783Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.2442466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.2442988Z     kernel = self.compile(
2025-05-07T20:32:39.2443544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.2444184Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.2444565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.2444795Z 
2025-05-07T20:32:39.2444998Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c63865d0>
2025-05-07T20:32:39.2446057Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.2447407Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c63edb20>}
2025-05-07T20:32:39.2448858Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.2449858Z context = <triton._C.libtriton.ir.context object at 0x7f13c5346bf0>
2025-05-07T20:32:39.2450149Z 
2025-05-07T20:32:39.2450310Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.2450827Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.2451291Z                            module_map=module_map)
2025-05-07T20:32:39.2451642Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.2451993Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.2452256Z E       ^
2025-05-07T20:32:39.2452706Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.2453152Z 
2025-05-07T20:32:39.2453573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.2454084Z 
2025-05-07T20:32:39.2454187Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.2454587Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.2454989Z     T=128,
2025-05-07T20:32:39.2455174Z     D=5120,
2025-05-07T20:32:39.2455367Z     scale_ub=1200.0,
2025-05-07T20:32:39.2455580Z     contiguous=True,
2025-05-07T20:32:39.2455804Z     compiled=False,
2025-05-07T20:32:39.2456000Z )
2025-05-07T20:32:39.2456304Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.2456784Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:39.2457176Z 
2025-05-07T20:32:39.2457257Z     @given(
2025-05-07T20:32:39.2457475Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.2457787Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.2458085Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.2458407Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.2458723Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.2458996Z     )
2025-05-07T20:32:39.2459344Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.2459791Z     def test_silu_mul_quant(
2025-05-07T20:32:39.2460037Z         self,
2025-05-07T20:32:39.2460234Z         T: int,
2025-05-07T20:32:39.2460423Z         D: int,
2025-05-07T20:32:39.2460643Z         scale_ub: Optional[float],
2025-05-07T20:32:39.2460923Z         contiguous: bool,
2025-05-07T20:32:39.2461159Z         compiled: bool,
2025-05-07T20:32:39.2461380Z     ) -> None:
2025-05-07T20:32:39.2461603Z         torch.manual_seed(2025)
2025-05-07T20:32:39.2461835Z     
2025-05-07T20:32:39.2462103Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.2462446Z     
2025-05-07T20:32:39.2462641Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.2462926Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.2463233Z         x = x_sign * x_clamp
2025-05-07T20:32:39.2463474Z         x0 = x[:, :D]
2025-05-07T20:32:39.2463689Z         x1 = x[:, D:]
2025-05-07T20:32:39.2463896Z     
2025-05-07T20:32:39.2464078Z         if contiguous:
2025-05-07T20:32:39.2464296Z             x0 = x0.contiguous()
2025-05-07T20:32:39.2464553Z             x1 = x1.contiguous()
2025-05-07T20:32:39.2464793Z     
2025-05-07T20:32:39.2464973Z         if scale_ub is not None:
2025-05-07T20:32:39.2465247Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.2465580Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.2465878Z             )
2025-05-07T20:32:39.2466069Z         else:
2025-05-07T20:32:39.2466275Z             scale_ub_tensor = None
2025-05-07T20:32:39.2466516Z     
2025-05-07T20:32:39.2466835Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.2467140Z             op = silu_mul_quant
2025-05-07T20:32:39.2467376Z             if compiled:
2025-05-07T20:32:39.2467671Z                 op = torch.compile(op)
2025-05-07T20:32:39.2467957Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.2468221Z     
2025-05-07T20:32:39.2468425Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.2468599Z 
2025-05-07T20:32:39.2468697Z moe/activation_test.py:117: 
2025-05-07T20:32:39.2468992Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.2469311Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.2469590Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.2470274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.2470948Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.2471496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.2472169Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.2472830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.2473349Z     kernel = self.compile(
2025-05-07T20:32:39.2473912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.2474557Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.2474952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.2475183Z 
2025-05-07T20:32:39.2475386Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c6b51fd0>
2025-05-07T20:32:39.2476570Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.2477922Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c5d59ee0>}
2025-05-07T20:32:39.2479294Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.2480297Z context = <triton._C.libtriton.ir.context object at 0x7f13c51ace30>
2025-05-07T20:32:39.2480587Z 
2025-05-07T20:32:39.2480754Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.2481275Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.2481744Z                            module_map=module_map)
2025-05-07T20:32:39.2482103Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.2482459Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.2482715Z E       ^
2025-05-07T20:32:39.2483166Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.2483616Z 
2025-05-07T20:32:39.2484021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.4011318Z 
2025-05-07T20:32:39.4011749Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.4012194Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.4012642Z     T=1,
2025-05-07T20:32:39.4012832Z     D=7168,
2025-05-07T20:32:39.4013030Z     scale_ub=1200.0,
2025-05-07T20:32:39.4013282Z     contiguous=True,
2025-05-07T20:32:39.4013506Z     compiled=True,
2025-05-07T20:32:39.4013709Z )
2025-05-07T20:32:39.4014350Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.4014836Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:39.4015098Z 
2025-05-07T20:32:39.4015178Z     @given(
2025-05-07T20:32:39.4015413Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.4015730Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.4016041Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.4016365Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.4016691Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.4016981Z     )
2025-05-07T20:32:39.4017334Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.4017792Z     def test_silu_mul_quant(
2025-05-07T20:32:39.4018048Z         self,
2025-05-07T20:32:39.4018239Z         T: int,
2025-05-07T20:32:39.4018444Z         D: int,
2025-05-07T20:32:39.4018666Z         scale_ub: Optional[float],
2025-05-07T20:32:39.4018938Z         contiguous: bool,
2025-05-07T20:32:39.4019180Z         compiled: bool,
2025-05-07T20:32:39.4019412Z     ) -> None:
2025-05-07T20:32:39.4019625Z         torch.manual_seed(2025)
2025-05-07T20:32:39.4019870Z     
2025-05-07T20:32:39.4020140Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.4020481Z     
2025-05-07T20:32:39.4020671Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.4020964Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.4021276Z         x = x_sign * x_clamp
2025-05-07T20:32:39.4021511Z         x0 = x[:, :D]
2025-05-07T20:32:39.4021732Z         x1 = x[:, D:]
2025-05-07T20:32:39.4021940Z     
2025-05-07T20:32:39.4022122Z         if contiguous:
2025-05-07T20:32:39.4022352Z             x0 = x0.contiguous()
2025-05-07T20:32:39.4022773Z             x1 = x1.contiguous()
2025-05-07T20:32:39.4023011Z     
2025-05-07T20:32:39.4023205Z         if scale_ub is not None:
2025-05-07T20:32:39.4023484Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.4023817Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.4024123Z             )
2025-05-07T20:32:39.4024322Z         else:
2025-05-07T20:32:39.4024526Z             scale_ub_tensor = None
2025-05-07T20:32:39.4024780Z     
2025-05-07T20:32:39.4025014Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.4025330Z             op = silu_mul_quant
2025-05-07T20:32:39.4025575Z             if compiled:
2025-05-07T20:32:39.4025827Z                 op = torch.compile(op)
2025-05-07T20:32:39.4026125Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.4026415Z     
2025-05-07T20:32:39.4026633Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.4026794Z 
2025-05-07T20:32:39.4026903Z moe/activation_test.py:117: 
2025-05-07T20:32:39.4027196Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.4027596Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.4027884Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.4028434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:39.4028990Z     return fn(*args, **kwargs)
2025-05-07T20:32:39.4029646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.4030329Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.4030860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.4031534Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.4032212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.4032751Z     kernel = self.compile(
2025-05-07T20:32:39.4033379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.4034032Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.4034428Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.4034652Z 
2025-05-07T20:32:39.4034856Z self = <triton.compiler.compiler.ASTSource object at 0x7f13e06c8c50>
2025-05-07T20:32:39.4035924Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.4037297Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c5d599e0>}
2025-05-07T20:32:39.4038676Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.4039691Z context = <triton._C.libtriton.ir.context object at 0x7f13c53cf330>
2025-05-07T20:32:39.4039974Z 
2025-05-07T20:32:39.4040381Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.4040904Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.4041378Z                            module_map=module_map)
2025-05-07T20:32:39.4041750Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.4042098Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.4042359Z E       ^
2025-05-07T20:32:39.4042825Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.4043398Z 
2025-05-07T20:32:39.4043817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.4044328Z 
2025-05-07T20:32:39.4044429Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.4044838Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.4045243Z     T=1,
2025-05-07T20:32:39.4045420Z     D=7168,
2025-05-07T20:32:39.4045613Z     scale_ub=1200.0,
2025-05-07T20:32:39.4045842Z     contiguous=False,
2025-05-07T20:32:39.4046065Z     compiled=True,
2025-05-07T20:32:39.4046274Z )
2025-05-07T20:32:39.4046591Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.4047065Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:39.4047344Z 
2025-05-07T20:32:39.4047423Z     @given(
2025-05-07T20:32:39.4047657Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.4047967Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.4048276Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.4048611Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.4048941Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.4049219Z     )
2025-05-07T20:32:39.4049566Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.4050022Z     def test_silu_mul_quant(
2025-05-07T20:32:39.4050256Z         self,
2025-05-07T20:32:39.4050453Z         T: int,
2025-05-07T20:32:39.4050655Z         D: int,
2025-05-07T20:32:39.4050871Z         scale_ub: Optional[float],
2025-05-07T20:32:39.4051147Z         contiguous: bool,
2025-05-07T20:32:39.4051392Z         compiled: bool,
2025-05-07T20:32:39.4051610Z     ) -> None:
2025-05-07T20:32:39.4051832Z         torch.manual_seed(2025)
2025-05-07T20:32:39.4052079Z     
2025-05-07T20:32:39.4052347Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.4052699Z     
2025-05-07T20:32:39.4052900Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.4053308Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.4053623Z         x = x_sign * x_clamp
2025-05-07T20:32:39.4053867Z         x0 = x[:, :D]
2025-05-07T20:32:39.4054090Z         x1 = x[:, D:]
2025-05-07T20:32:39.4054297Z     
2025-05-07T20:32:39.4054487Z         if contiguous:
2025-05-07T20:32:39.4054721Z             x0 = x0.contiguous()
2025-05-07T20:32:39.4054976Z             x1 = x1.contiguous()
2025-05-07T20:32:39.4055227Z     
2025-05-07T20:32:39.4055424Z         if scale_ub is not None:
2025-05-07T20:32:39.4055694Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.4056031Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.4056351Z             )
2025-05-07T20:32:39.4056542Z         else:
2025-05-07T20:32:39.4056755Z             scale_ub_tensor = None
2025-05-07T20:32:39.4057013Z     
2025-05-07T20:32:39.4057239Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.4057551Z             op = silu_mul_quant
2025-05-07T20:32:39.4057810Z             if compiled:
2025-05-07T20:32:39.4058050Z                 op = torch.compile(op)
2025-05-07T20:32:39.4058350Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.4058625Z     
2025-05-07T20:32:39.4058822Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.4058985Z 
2025-05-07T20:32:39.4059084Z moe/activation_test.py:117: 
2025-05-07T20:32:39.4059376Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.4059703Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.4059975Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.4060533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:39.4061089Z     return fn(*args, **kwargs)
2025-05-07T20:32:39.4061744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.4062517Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.4063054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.4063728Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.4064394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.4064923Z     kernel = self.compile(
2025-05-07T20:32:39.4065461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.4066132Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.4066517Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.4066750Z 
2025-05-07T20:32:39.4066960Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c67bda50>
2025-05-07T20:32:39.4068174Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.4069526Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c5d59d00>}
2025-05-07T20:32:39.4070838Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.4071842Z context = <triton._C.libtriton.ir.context object at 0x7f13c53525b0>
2025-05-07T20:32:39.4072129Z 
2025-05-07T20:32:39.4072291Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.4072814Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.4073381Z                            module_map=module_map)
2025-05-07T20:32:39.4081435Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.4081835Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.4082093Z E       ^
2025-05-07T20:32:39.4082561Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.4083029Z 
2025-05-07T20:32:39.4083463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.8365265Z 
2025-05-07T20:32:39.8365740Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.8366494Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.8367048Z     T=1,
2025-05-07T20:32:39.8367306Z     D=7168,
2025-05-07T20:32:39.8367600Z     scale_ub=None,
2025-05-07T20:32:39.8367905Z     contiguous=False,
2025-05-07T20:32:39.8368234Z     compiled=True,
2025-05-07T20:32:39.8368534Z )
2025-05-07T20:32:39.8368978Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.8369633Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:39.8369944Z 
2025-05-07T20:32:39.8370024Z     @given(
2025-05-07T20:32:39.8370253Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.8370555Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.8370853Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.8371174Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.8371489Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.8371764Z     )
2025-05-07T20:32:39.8372114Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.8372757Z     def test_silu_mul_quant(
2025-05-07T20:32:39.8372996Z         self,
2025-05-07T20:32:39.8373182Z         T: int,
2025-05-07T20:32:39.8373366Z         D: int,
2025-05-07T20:32:39.8373584Z         scale_ub: Optional[float],
2025-05-07T20:32:39.8373851Z         contiguous: bool,
2025-05-07T20:32:39.8374088Z         compiled: bool,
2025-05-07T20:32:39.8374302Z     ) -> None:
2025-05-07T20:32:39.8374511Z         torch.manual_seed(2025)
2025-05-07T20:32:39.8374753Z     
2025-05-07T20:32:39.8375012Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.8375349Z     
2025-05-07T20:32:39.8375540Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.8375815Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.8376122Z         x = x_sign * x_clamp
2025-05-07T20:32:39.8376388Z         x0 = x[:, :D]
2025-05-07T20:32:39.8376598Z         x1 = x[:, D:]
2025-05-07T20:32:39.8376798Z     
2025-05-07T20:32:39.8376974Z         if contiguous:
2025-05-07T20:32:39.8377201Z             x0 = x0.contiguous()
2025-05-07T20:32:39.8377454Z             x1 = x1.contiguous()
2025-05-07T20:32:39.8377684Z     
2025-05-07T20:32:39.8377873Z         if scale_ub is not None:
2025-05-07T20:32:39.8378149Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.8378479Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.8378789Z             )
2025-05-07T20:32:39.8378979Z         else:
2025-05-07T20:32:39.8379177Z             scale_ub_tensor = None
2025-05-07T20:32:39.8379424Z     
2025-05-07T20:32:39.8379646Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.8379947Z             op = silu_mul_quant
2025-05-07T20:32:39.8380188Z             if compiled:
2025-05-07T20:32:39.8380427Z                 op = torch.compile(op)
2025-05-07T20:32:39.8380716Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.8380977Z     
2025-05-07T20:32:39.8381162Z         y_fp8, y_scale = fn()
2025-05-07T20:32:39.8381438Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:39.8381716Z     
2025-05-07T20:32:39.8381946Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.8382406Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:39.8382686Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:39.8382993Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:39.8383339Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:39.8383632Z     
2025-05-07T20:32:39.8383821Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:39.8384014Z 
2025-05-07T20:32:39.8384110Z moe/activation_test.py:126: 
2025-05-07T20:32:39.8384395Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.8384711Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:39.8385028Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:39.8385804Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:39.8386543Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:39.8387088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.8387845Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.8388526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:39.8389229Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:39.8389947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:39.8390569Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:39.8391172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:39.8391758Z     fn()
2025-05-07T20:32:39.8392279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:39.8392868Z     self.fn.run(
2025-05-07T20:32:39.8393319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.8393834Z     kernel = self.compile(
2025-05-07T20:32:39.8394366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.8395007Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.8395393Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.8395625Z 
2025-05-07T20:32:39.8395826Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c774b5d0>
2025-05-07T20:32:39.8396895Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.8398252Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c53ac2c0>}
2025-05-07T20:32:39.8399559Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.8400564Z context = <triton._C.libtriton.ir.context object at 0x7f13c53c6330>
2025-05-07T20:32:39.8400852Z 
2025-05-07T20:32:39.8401014Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.8401527Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.8401993Z                            module_map=module_map)
2025-05-07T20:32:39.8402358Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.8402794Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:39.8403051Z E       ^
2025-05-07T20:32:39.8403502Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.8403950Z 
2025-05-07T20:32:39.8404363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.8404869Z 
2025-05-07T20:32:39.8404972Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.8405372Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.8405777Z     T=1,
2025-05-07T20:32:39.8405953Z     D=5120,
2025-05-07T20:32:39.8406134Z     scale_ub=1200.0,
2025-05-07T20:32:39.8406350Z     contiguous=False,
2025-05-07T20:32:39.8406565Z     compiled=True,
2025-05-07T20:32:39.8406768Z )
2025-05-07T20:32:39.8407073Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.8407551Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:39.8407821Z 
2025-05-07T20:32:39.8407900Z     @given(
2025-05-07T20:32:39.8408117Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.8408415Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.8408712Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.8409025Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.8409344Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.8409619Z     )
2025-05-07T20:32:39.8409958Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.8410399Z     def test_silu_mul_quant(
2025-05-07T20:32:39.8410637Z         self,
2025-05-07T20:32:39.8410828Z         T: int,
2025-05-07T20:32:39.8411014Z         D: int,
2025-05-07T20:32:39.8411318Z         scale_ub: Optional[float],
2025-05-07T20:32:39.8411586Z         contiguous: bool,
2025-05-07T20:32:39.8411820Z         compiled: bool,
2025-05-07T20:32:39.8412042Z     ) -> None:
2025-05-07T20:32:39.8412248Z         torch.manual_seed(2025)
2025-05-07T20:32:39.8412478Z     
2025-05-07T20:32:39.8412736Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.8413069Z     
2025-05-07T20:32:39.8413245Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.8413528Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.8413834Z         x = x_sign * x_clamp
2025-05-07T20:32:39.8414063Z         x0 = x[:, :D]
2025-05-07T20:32:39.8414276Z         x1 = x[:, D:]
2025-05-07T20:32:39.8414477Z     
2025-05-07T20:32:39.8414652Z         if contiguous:
2025-05-07T20:32:39.8414876Z             x0 = x0.contiguous()
2025-05-07T20:32:39.8415120Z             x1 = x1.contiguous()
2025-05-07T20:32:39.8415348Z     
2025-05-07T20:32:39.8415531Z         if scale_ub is not None:
2025-05-07T20:32:39.8415804Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.8416142Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.8416433Z             )
2025-05-07T20:32:39.8416619Z         else:
2025-05-07T20:32:39.8416823Z             scale_ub_tensor = None
2025-05-07T20:32:39.8417060Z     
2025-05-07T20:32:39.8417283Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.8417588Z             op = silu_mul_quant
2025-05-07T20:32:39.8417826Z             if compiled:
2025-05-07T20:32:39.8418064Z                 op = torch.compile(op)
2025-05-07T20:32:39.8418348Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.8418603Z     
2025-05-07T20:32:39.8418782Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.8418939Z 
2025-05-07T20:32:39.8419037Z moe/activation_test.py:117: 
2025-05-07T20:32:39.8419320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.8419638Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.8419906Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.8420538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:39.8421080Z     return fn(*args, **kwargs)
2025-05-07T20:32:39.8421724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.8422401Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.8422932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.8423592Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.8424244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.8424766Z     kernel = self.compile(
2025-05-07T20:32:39.8425301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.8425945Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.8426339Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.8426588Z 
2025-05-07T20:32:39.8426823Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c4ecee50>
2025-05-07T20:32:39.8427919Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.8429309Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c589f9c0>}
2025-05-07T20:32:39.8430626Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.8431720Z context = <triton._C.libtriton.ir.context object at 0x7f13c6d2e270>
2025-05-07T20:32:39.8431999Z 
2025-05-07T20:32:39.8432164Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.8432674Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.8433139Z                            module_map=module_map)
2025-05-07T20:32:39.8433494Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.8433838Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.8434096Z E       ^
2025-05-07T20:32:39.8434548Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.8434986Z 
2025-05-07T20:32:39.8435402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.9820698Z 
2025-05-07T20:32:39.9820907Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.9821512Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.9822126Z     T=1,
2025-05-07T20:32:39.9822376Z     D=5120,
2025-05-07T20:32:39.9822631Z     scale_ub=1200.0,
2025-05-07T20:32:39.9822870Z     contiguous=False,
2025-05-07T20:32:39.9823100Z     compiled=False,
2025-05-07T20:32:39.9823312Z )
2025-05-07T20:32:39.9823639Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.9824146Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:39.9824421Z 
2025-05-07T20:32:39.9824504Z     @given(
2025-05-07T20:32:39.9824741Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.9825061Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.9825377Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.9825716Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.9826255Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.9826556Z     )
2025-05-07T20:32:39.9826914Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.9827351Z     def test_silu_mul_quant(
2025-05-07T20:32:39.9827665Z         self,
2025-05-07T20:32:39.9827853Z         T: int,
2025-05-07T20:32:39.9828051Z         D: int,
2025-05-07T20:32:39.9828272Z         scale_ub: Optional[float],
2025-05-07T20:32:39.9828538Z         contiguous: bool,
2025-05-07T20:32:39.9828775Z         compiled: bool,
2025-05-07T20:32:39.9828999Z     ) -> None:
2025-05-07T20:32:39.9829208Z         torch.manual_seed(2025)
2025-05-07T20:32:39.9829451Z     
2025-05-07T20:32:39.9829722Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.9830057Z     
2025-05-07T20:32:39.9830262Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.9830561Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.9830875Z         x = x_sign * x_clamp
2025-05-07T20:32:39.9831124Z         x0 = x[:, :D]
2025-05-07T20:32:39.9831348Z         x1 = x[:, D:]
2025-05-07T20:32:39.9831566Z     
2025-05-07T20:32:39.9831751Z         if contiguous:
2025-05-07T20:32:39.9831988Z             x0 = x0.contiguous()
2025-05-07T20:32:39.9832252Z             x1 = x1.contiguous()
2025-05-07T20:32:39.9832489Z     
2025-05-07T20:32:39.9832683Z         if scale_ub is not None:
2025-05-07T20:32:39.9832966Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.9833304Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.9833627Z             )
2025-05-07T20:32:39.9833828Z         else:
2025-05-07T20:32:39.9834040Z             scale_ub_tensor = None
2025-05-07T20:32:39.9834301Z     
2025-05-07T20:32:39.9834542Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.9834981Z             op = silu_mul_quant
2025-05-07T20:32:39.9835239Z             if compiled:
2025-05-07T20:32:39.9835496Z                 op = torch.compile(op)
2025-05-07T20:32:39.9835807Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.9836090Z     
2025-05-07T20:32:39.9836289Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.9836470Z 
2025-05-07T20:32:39.9836590Z moe/activation_test.py:117: 
2025-05-07T20:32:39.9836911Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.9837253Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.9837546Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.9838246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.9838956Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.9839514Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.9840465Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.9841305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.9841855Z     kernel = self.compile(
2025-05-07T20:32:39.9842414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.9843059Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.9843457Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.9843694Z 
2025-05-07T20:32:39.9843901Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c7ebe550>
2025-05-07T20:32:39.9844972Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.9846502Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c6b9a3e0>}
2025-05-07T20:32:39.9847859Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.9848929Z context = <triton._C.libtriton.ir.context object at 0x7f13c524c070>
2025-05-07T20:32:39.9849213Z 
2025-05-07T20:32:39.9849398Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.9849924Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.9850384Z                            module_map=module_map)
2025-05-07T20:32:39.9850751Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.9851110Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.9851367Z E       ^
2025-05-07T20:32:39.9851832Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.9852277Z 
2025-05-07T20:32:39.9852700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.9853206Z 
2025-05-07T20:32:39.9853319Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.9853721Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.9854120Z     T=16384,
2025-05-07T20:32:39.9854316Z     D=5120,
2025-05-07T20:32:39.9854505Z     scale_ub=1200.0,
2025-05-07T20:32:39.9854738Z     contiguous=False,
2025-05-07T20:32:39.9854967Z     compiled=True,
2025-05-07T20:32:39.9855167Z )
2025-05-07T20:32:39.9855496Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:39.9856120Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:39.9856400Z 
2025-05-07T20:32:39.9856486Z     @given(
2025-05-07T20:32:39.9856723Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:39.9857044Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:39.9857358Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:39.9857683Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:39.9858014Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:39.9858305Z     )
2025-05-07T20:32:39.9858650Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:39.9859110Z     def test_silu_mul_quant(
2025-05-07T20:32:39.9859356Z         self,
2025-05-07T20:32:39.9859555Z         T: int,
2025-05-07T20:32:39.9859757Z         D: int,
2025-05-07T20:32:39.9859981Z         scale_ub: Optional[float],
2025-05-07T20:32:39.9860250Z         contiguous: bool,
2025-05-07T20:32:39.9860509Z         compiled: bool,
2025-05-07T20:32:39.9860744Z     ) -> None:
2025-05-07T20:32:39.9860973Z         torch.manual_seed(2025)
2025-05-07T20:32:39.9861213Z     
2025-05-07T20:32:39.9861500Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:39.9861856Z     
2025-05-07T20:32:39.9862049Z         x_sign = torch.sign(x)
2025-05-07T20:32:39.9862355Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:39.9862677Z         x = x_sign * x_clamp
2025-05-07T20:32:39.9862919Z         x0 = x[:, :D]
2025-05-07T20:32:39.9863147Z         x1 = x[:, D:]
2025-05-07T20:32:39.9863356Z     
2025-05-07T20:32:39.9863537Z         if contiguous:
2025-05-07T20:32:39.9863767Z             x0 = x0.contiguous()
2025-05-07T20:32:39.9864023Z             x1 = x1.contiguous()
2025-05-07T20:32:39.9864262Z     
2025-05-07T20:32:39.9864454Z         if scale_ub is not None:
2025-05-07T20:32:39.9864734Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:39.9865066Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:39.9865374Z             )
2025-05-07T20:32:39.9865569Z         else:
2025-05-07T20:32:39.9865866Z             scale_ub_tensor = None
2025-05-07T20:32:39.9866115Z     
2025-05-07T20:32:39.9866345Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:39.9866665Z             op = silu_mul_quant
2025-05-07T20:32:39.9866913Z             if compiled:
2025-05-07T20:32:39.9867163Z                 op = torch.compile(op)
2025-05-07T20:32:39.9867547Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.9867822Z     
2025-05-07T20:32:39.9868018Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:39.9868184Z 
2025-05-07T20:32:39.9868296Z moe/activation_test.py:117: 
2025-05-07T20:32:39.9868585Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.9868918Z moe/activation_test.py:115: in fn
2025-05-07T20:32:39.9869201Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:39.9869787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:39.9870586Z     return fn(*args, **kwargs)
2025-05-07T20:32:39.9871351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:39.9872043Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:39.9872593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:39.9873282Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:39.9873949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:39.9874483Z     kernel = self.compile(
2025-05-07T20:32:39.9875045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:39.9875815Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:39.9876219Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:39.9876463Z 
2025-05-07T20:32:39.9876712Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c67be050>
2025-05-07T20:32:39.9877783Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:39.9879145Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c63c3560>}
2025-05-07T20:32:39.9880478Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:39.9881545Z context = <triton._C.libtriton.ir.context object at 0x7f13c4ec80f0>
2025-05-07T20:32:39.9881832Z 
2025-05-07T20:32:39.9882002Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:39.9882526Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:39.9883007Z                            module_map=module_map)
2025-05-07T20:32:39.9883374Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:39.9883728Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:39.9883997Z E       ^
2025-05-07T20:32:39.9884466Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:39.9884916Z 
2025-05-07T20:32:39.9885334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:39.9885856Z 
2025-05-07T20:32:39.9885961Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:39.9886374Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:39.9886919Z     T=2048,
2025-05-07T20:32:39.9887106Z     D=7168,
2025-05-07T20:32:39.9887306Z     scale_ub=1200.0,
2025-05-07T20:32:39.9887534Z     contiguous=False,
2025-05-07T20:32:39.9887776Z     compiled=True,
2025-05-07T20:32:40.1716122Z )
2025-05-07T20:32:40.1729439Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.1730200Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:40.1730586Z 
2025-05-07T20:32:40.1730718Z     @given(
2025-05-07T20:32:40.1731024Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.1731440Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.1731827Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.1732151Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.1732489Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.1732774Z     )
2025-05-07T20:32:40.1733118Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.1733570Z     def test_silu_mul_quant(
2025-05-07T20:32:40.1733811Z         self,
2025-05-07T20:32:40.1733995Z         T: int,
2025-05-07T20:32:40.1734186Z         D: int,
2025-05-07T20:32:40.1734401Z         scale_ub: Optional[float],
2025-05-07T20:32:40.1734665Z         contiguous: bool,
2025-05-07T20:32:40.1734895Z         compiled: bool,
2025-05-07T20:32:40.1735117Z     ) -> None:
2025-05-07T20:32:40.1735325Z         torch.manual_seed(2025)
2025-05-07T20:32:40.1735557Z     
2025-05-07T20:32:40.1735826Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.1736159Z     
2025-05-07T20:32:40.1736340Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.1736644Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.1737158Z         x = x_sign * x_clamp
2025-05-07T20:32:40.1737391Z         x0 = x[:, :D]
2025-05-07T20:32:40.1737592Z         x1 = x[:, D:]
2025-05-07T20:32:40.1737794Z     
2025-05-07T20:32:40.1737976Z         if contiguous:
2025-05-07T20:32:40.1738198Z             x0 = x0.contiguous()
2025-05-07T20:32:40.1738450Z             x1 = x1.contiguous()
2025-05-07T20:32:40.1738687Z     
2025-05-07T20:32:40.1738866Z         if scale_ub is not None:
2025-05-07T20:32:40.1739129Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.1739453Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.1739746Z             )
2025-05-07T20:32:40.1739934Z         else:
2025-05-07T20:32:40.1740348Z             scale_ub_tensor = None
2025-05-07T20:32:40.1740594Z     
2025-05-07T20:32:40.1740812Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.1741114Z             op = silu_mul_quant
2025-05-07T20:32:40.1741356Z             if compiled:
2025-05-07T20:32:40.1741598Z                 op = torch.compile(op)
2025-05-07T20:32:40.1741893Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.1742153Z     
2025-05-07T20:32:40.1742334Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.1742494Z 
2025-05-07T20:32:40.1742589Z moe/activation_test.py:117: 
2025-05-07T20:32:40.1742875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.1743193Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.1743471Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.1744020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:40.1744572Z     return fn(*args, **kwargs)
2025-05-07T20:32:40.1745256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.1745939Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.1746472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.1747262Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.1747980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.1748498Z     kernel = self.compile(
2025-05-07T20:32:40.1749043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.1749679Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.1750075Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.1750297Z 
2025-05-07T20:32:40.1750508Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c6b53d50>
2025-05-07T20:32:40.1751569Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.1753012Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c63428e0>}
2025-05-07T20:32:40.1754363Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.1755360Z context = <triton._C.libtriton.ir.context object at 0x7f13c578a530>
2025-05-07T20:32:40.1755647Z 
2025-05-07T20:32:40.1755816Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.1756338Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.1756787Z                            module_map=module_map)
2025-05-07T20:32:40.1757272Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.1757623Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.1757870Z E       ^
2025-05-07T20:32:40.1758332Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.1758774Z 
2025-05-07T20:32:40.1759185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.1759685Z 
2025-05-07T20:32:40.1759789Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.1760181Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.1760571Z     T=1,
2025-05-07T20:32:40.1760747Z     D=5120,
2025-05-07T20:32:40.1760926Z     scale_ub=None,
2025-05-07T20:32:40.1761136Z     contiguous=False,
2025-05-07T20:32:40.1761353Z     compiled=False,
2025-05-07T20:32:40.1761543Z )
2025-05-07T20:32:40.1761855Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.1762340Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:40.1762596Z 
2025-05-07T20:32:40.1762679Z     @given(
2025-05-07T20:32:40.1762896Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.1763200Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.1763495Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.1763812Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.1764125Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.1764398Z     )
2025-05-07T20:32:40.1764738Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.1765215Z     def test_silu_mul_quant(
2025-05-07T20:32:40.1765444Z         self,
2025-05-07T20:32:40.1765632Z         T: int,
2025-05-07T20:32:40.1765822Z         D: int,
2025-05-07T20:32:40.1766032Z         scale_ub: Optional[float],
2025-05-07T20:32:40.1766299Z         contiguous: bool,
2025-05-07T20:32:40.1766532Z         compiled: bool,
2025-05-07T20:32:40.1766742Z     ) -> None:
2025-05-07T20:32:40.1767034Z         torch.manual_seed(2025)
2025-05-07T20:32:40.1767274Z     
2025-05-07T20:32:40.1767539Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.1767866Z     
2025-05-07T20:32:40.1768056Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.1768345Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.1768639Z         x = x_sign * x_clamp
2025-05-07T20:32:40.1768869Z         x0 = x[:, :D]
2025-05-07T20:32:40.1769082Z         x1 = x[:, D:]
2025-05-07T20:32:40.1769274Z     
2025-05-07T20:32:40.1769457Z         if contiguous:
2025-05-07T20:32:40.1769687Z             x0 = x0.contiguous()
2025-05-07T20:32:40.1769939Z             x1 = x1.contiguous()
2025-05-07T20:32:40.1770177Z     
2025-05-07T20:32:40.1770363Z         if scale_ub is not None:
2025-05-07T20:32:40.1770631Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.1770956Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.1771251Z             )
2025-05-07T20:32:40.1771444Z         else:
2025-05-07T20:32:40.1771652Z             scale_ub_tensor = None
2025-05-07T20:32:40.1771895Z     
2025-05-07T20:32:40.1772123Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.1772427Z             op = silu_mul_quant
2025-05-07T20:32:40.1772674Z             if compiled:
2025-05-07T20:32:40.1772916Z                 op = torch.compile(op)
2025-05-07T20:32:40.1773197Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.1773469Z     
2025-05-07T20:32:40.1773655Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.1773814Z 
2025-05-07T20:32:40.1773912Z moe/activation_test.py:117: 
2025-05-07T20:32:40.1774210Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.1774529Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.1774887Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.1775565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.1776239Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.1776812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.1777481Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.1778131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.1778649Z     kernel = self.compile(
2025-05-07T20:32:40.1779182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.1779816Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.1780202Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.1780422Z 
2025-05-07T20:32:40.1780633Z self = <triton.compiler.compiler.ASTSource object at 0x7f13e03db4d0>
2025-05-07T20:32:40.1781685Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.1783034Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c63434c0>}
2025-05-07T20:32:40.1784341Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.1785337Z context = <triton._C.libtriton.ir.context object at 0x7f13c5703930>
2025-05-07T20:32:40.1785626Z 
2025-05-07T20:32:40.1785796Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.1786405Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.1786876Z                            module_map=module_map)
2025-05-07T20:32:40.1787235Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.1787627Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.1787878Z E       ^
2025-05-07T20:32:40.1788328Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.1788764Z 
2025-05-07T20:32:40.1789187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.1789687Z 
2025-05-07T20:32:40.1789786Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.1790188Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.1790585Z     T=4096,
2025-05-07T20:32:40.1790773Z     D=7168,
2025-05-07T20:32:40.1790956Z     scale_ub=1200.0,
2025-05-07T20:32:40.1791184Z     contiguous=False,
2025-05-07T20:32:40.1791402Z     compiled=False,
2025-05-07T20:32:40.1791601Z )
2025-05-07T20:32:40.1791911Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.1792396Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:40.1792671Z 
2025-05-07T20:32:40.1792751Z     @given(
2025-05-07T20:32:40.1792976Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.1793284Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.1793576Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.1793894Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.1794210Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.1794484Z     )
2025-05-07T20:32:40.1794914Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.1795339Z     def test_silu_mul_quant(
2025-05-07T20:32:40.1795576Z         self,
2025-05-07T20:32:40.1795760Z         T: int,
2025-05-07T20:32:40.1795949Z         D: int,
2025-05-07T20:32:40.1796161Z         scale_ub: Optional[float],
2025-05-07T20:32:40.1796419Z         contiguous: bool,
2025-05-07T20:32:40.1796652Z         compiled: bool,
2025-05-07T20:32:40.1796866Z     ) -> None:
2025-05-07T20:32:40.1797073Z         torch.manual_seed(2025)
2025-05-07T20:32:40.1797307Z     
2025-05-07T20:32:40.1797574Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.1797906Z     
2025-05-07T20:32:40.1798094Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.1798377Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.1798671Z         x = x_sign * x_clamp
2025-05-07T20:32:40.1798904Z         x0 = x[:, :D]
2025-05-07T20:32:40.1799127Z         x1 = x[:, D:]
2025-05-07T20:32:40.1799321Z     
2025-05-07T20:32:40.1799496Z         if contiguous:
2025-05-07T20:32:40.1799717Z             x0 = x0.contiguous()
2025-05-07T20:32:40.1799964Z             x1 = x1.contiguous()
2025-05-07T20:32:40.1800198Z     
2025-05-07T20:32:40.1800386Z         if scale_ub is not None:
2025-05-07T20:32:40.1800650Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.1800971Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.1801268Z             )
2025-05-07T20:32:40.1801458Z         else:
2025-05-07T20:32:40.1801656Z             scale_ub_tensor = None
2025-05-07T20:32:40.1801901Z     
2025-05-07T20:32:40.1802125Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.1802422Z             op = silu_mul_quant
2025-05-07T20:32:40.1802662Z             if compiled:
2025-05-07T20:32:40.1802897Z                 op = torch.compile(op)
2025-05-07T20:32:40.1803175Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.1803448Z     
2025-05-07T20:32:40.1803636Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.1803792Z 
2025-05-07T20:32:40.1803885Z moe/activation_test.py:117: 
2025-05-07T20:32:40.1804293Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.1804618Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.1804889Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.1805563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.1806242Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.1806821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.1807486Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.1808142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.1808669Z     kernel = self.compile(
2025-05-07T20:32:40.1809210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.1809846Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.1810228Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.1810448Z 
2025-05-07T20:32:40.1810656Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c6098350>
2025-05-07T20:32:40.1811710Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.1813047Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c53ad080>}
2025-05-07T20:32:40.1814443Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.1815452Z context = <triton._C.libtriton.ir.context object at 0x7f13c52b1630>
2025-05-07T20:32:40.1815734Z 
2025-05-07T20:32:40.1815906Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.1816412Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.1816927Z                            module_map=module_map)
2025-05-07T20:32:40.1817282Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.1817635Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.1817884Z E       ^
2025-05-07T20:32:40.1818339Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.1818782Z 
2025-05-07T20:32:40.1819204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.3346876Z 
2025-05-07T20:32:40.3347219Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.3347923Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.3348471Z     T=16384,
2025-05-07T20:32:40.3348771Z     D=7168,
2025-05-07T20:32:40.3348987Z     scale_ub=None,
2025-05-07T20:32:40.3349202Z     contiguous=True,
2025-05-07T20:32:40.3349417Z     compiled=True,
2025-05-07T20:32:40.3349623Z )
2025-05-07T20:32:40.3349940Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.3350424Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:40.3350695Z 
2025-05-07T20:32:40.3350775Z     @given(
2025-05-07T20:32:40.3351003Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.3351321Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.3351627Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.3352127Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.3352454Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.3352733Z     )
2025-05-07T20:32:40.3353076Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.3353538Z     def test_silu_mul_quant(
2025-05-07T20:32:40.3353771Z         self,
2025-05-07T20:32:40.3353966Z         T: int,
2025-05-07T20:32:40.3354166Z         D: int,
2025-05-07T20:32:40.3354380Z         scale_ub: Optional[float],
2025-05-07T20:32:40.3354650Z         contiguous: bool,
2025-05-07T20:32:40.3354886Z         compiled: bool,
2025-05-07T20:32:40.3355105Z     ) -> None:
2025-05-07T20:32:40.3355316Z         torch.manual_seed(2025)
2025-05-07T20:32:40.3355559Z     
2025-05-07T20:32:40.3355823Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.3356167Z     
2025-05-07T20:32:40.3356362Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.3356652Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.3356953Z         x = x_sign * x_clamp
2025-05-07T20:32:40.3357193Z         x0 = x[:, :D]
2025-05-07T20:32:40.3357407Z         x1 = x[:, D:]
2025-05-07T20:32:40.3357614Z     
2025-05-07T20:32:40.3357872Z         if contiguous:
2025-05-07T20:32:40.3358136Z             x0 = x0.contiguous()
2025-05-07T20:32:40.3358389Z             x1 = x1.contiguous()
2025-05-07T20:32:40.3358626Z     
2025-05-07T20:32:40.3358817Z         if scale_ub is not None:
2025-05-07T20:32:40.3359082Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.3359413Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.3359724Z             )
2025-05-07T20:32:40.3359913Z         else:
2025-05-07T20:32:40.3360126Z             scale_ub_tensor = None
2025-05-07T20:32:40.3360382Z     
2025-05-07T20:32:40.3360738Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.3361047Z             op = silu_mul_quant
2025-05-07T20:32:40.3361305Z             if compiled:
2025-05-07T20:32:40.3361545Z                 op = torch.compile(op)
2025-05-07T20:32:40.3361839Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.3362110Z     
2025-05-07T20:32:40.3362298Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.3362464Z 
2025-05-07T20:32:40.3362562Z moe/activation_test.py:117: 
2025-05-07T20:32:40.3362856Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.3363187Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.3363455Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.3364015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:40.3364567Z     return fn(*args, **kwargs)
2025-05-07T20:32:40.3365226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.3365902Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.3366623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.3367306Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.3368102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.3368644Z     kernel = self.compile(
2025-05-07T20:32:40.3369188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.3369830Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.3370212Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.3370443Z 
2025-05-07T20:32:40.3370655Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c4ececd0>
2025-05-07T20:32:40.3371814Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.3373192Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c53ae2a0>}
2025-05-07T20:32:40.3374509Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.3375514Z context = <triton._C.libtriton.ir.context object at 0x7f13c52880b0>
2025-05-07T20:32:40.3375805Z 
2025-05-07T20:32:40.3375969Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.3376500Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.3376964Z                            module_map=module_map)
2025-05-07T20:32:40.3377325Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.3377682Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.3377936Z E       ^
2025-05-07T20:32:40.3378390Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.3378833Z 
2025-05-07T20:32:40.3379247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.3379751Z 
2025-05-07T20:32:40.3379854Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.3380254Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.3380650Z     T=4096,
2025-05-07T20:32:40.3380836Z     D=5120,
2025-05-07T20:32:40.3381108Z     scale_ub=None,
2025-05-07T20:32:40.3381314Z     contiguous=False,
2025-05-07T20:32:40.3381536Z     compiled=True,
2025-05-07T20:32:40.3381745Z )
2025-05-07T20:32:40.3382065Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.3382556Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:40.3382832Z 
2025-05-07T20:32:40.3382918Z     @given(
2025-05-07T20:32:40.3383145Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.3383464Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.3383768Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.3384089Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.3384413Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.3384694Z     )
2025-05-07T20:32:40.3385045Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.3385498Z     def test_silu_mul_quant(
2025-05-07T20:32:40.3385738Z         self,
2025-05-07T20:32:40.3385936Z         T: int,
2025-05-07T20:32:40.3386124Z         D: int,
2025-05-07T20:32:40.3386347Z         scale_ub: Optional[float],
2025-05-07T20:32:40.3386640Z         contiguous: bool,
2025-05-07T20:32:40.3386894Z         compiled: bool,
2025-05-07T20:32:40.3387118Z     ) -> None:
2025-05-07T20:32:40.3387332Z         torch.manual_seed(2025)
2025-05-07T20:32:40.3387623Z     
2025-05-07T20:32:40.3387897Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.3388228Z     
2025-05-07T20:32:40.3388414Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.3388700Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.3389002Z         x = x_sign * x_clamp
2025-05-07T20:32:40.3389234Z         x0 = x[:, :D]
2025-05-07T20:32:40.3389451Z         x1 = x[:, D:]
2025-05-07T20:32:40.3389658Z     
2025-05-07T20:32:40.3389847Z         if contiguous:
2025-05-07T20:32:40.3390076Z             x0 = x0.contiguous()
2025-05-07T20:32:40.3390331Z             x1 = x1.contiguous()
2025-05-07T20:32:40.3390567Z     
2025-05-07T20:32:40.3390838Z         if scale_ub is not None:
2025-05-07T20:32:40.3391110Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.3391438Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.3391735Z             )
2025-05-07T20:32:40.3391925Z         else:
2025-05-07T20:32:40.3392135Z             scale_ub_tensor = None
2025-05-07T20:32:40.3392381Z     
2025-05-07T20:32:40.3392610Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.3392922Z             op = silu_mul_quant
2025-05-07T20:32:40.3393167Z             if compiled:
2025-05-07T20:32:40.3393410Z                 op = torch.compile(op)
2025-05-07T20:32:40.3393699Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.3393961Z     
2025-05-07T20:32:40.3394162Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.3394327Z 
2025-05-07T20:32:40.3394427Z moe/activation_test.py:117: 
2025-05-07T20:32:40.3394721Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.3395056Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.3402447Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.3403023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:40.3403604Z     return fn(*args, **kwargs)
2025-05-07T20:32:40.3404255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.3404932Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.3405457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.3406128Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.3406795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.3407438Z     kernel = self.compile(
2025-05-07T20:32:40.3407981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.3408618Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.3409007Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.3409238Z 
2025-05-07T20:32:40.3409440Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c6098ad0>
2025-05-07T20:32:40.3410513Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.3411864Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c53aefc0>}
2025-05-07T20:32:40.3413180Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.3414186Z context = <triton._C.libtriton.ir.context object at 0x7f13c4ff01b0>
2025-05-07T20:32:40.3414471Z 
2025-05-07T20:32:40.3414633Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.3415154Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.3415607Z                            module_map=module_map)
2025-05-07T20:32:40.3415962Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.3416305Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.3416546Z E       ^
2025-05-07T20:32:40.3416993Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.3417443Z 
2025-05-07T20:32:40.3417947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.4775493Z 
2025-05-07T20:32:40.4775818Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.4776429Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.4777019Z     T=4096,
2025-05-07T20:32:40.4777260Z     D=5120,
2025-05-07T20:32:40.4777583Z     scale_ub=1200.0,
2025-05-07T20:32:40.4777884Z     contiguous=False,
2025-05-07T20:32:40.4778107Z     compiled=False,
2025-05-07T20:32:40.4778310Z )
2025-05-07T20:32:40.4778614Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.4779108Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:40.4779388Z 
2025-05-07T20:32:40.4779465Z     @given(
2025-05-07T20:32:40.4779704Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.4780003Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.4780306Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.4780630Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.4780943Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.4781224Z     )
2025-05-07T20:32:40.4781567Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.4781994Z     def test_silu_mul_quant(
2025-05-07T20:32:40.4782229Z         self,
2025-05-07T20:32:40.4782420Z         T: int,
2025-05-07T20:32:40.4782612Z         D: int,
2025-05-07T20:32:40.4782828Z         scale_ub: Optional[float],
2025-05-07T20:32:40.4783097Z         contiguous: bool,
2025-05-07T20:32:40.4783324Z         compiled: bool,
2025-05-07T20:32:40.4783545Z     ) -> None:
2025-05-07T20:32:40.4783755Z         torch.manual_seed(2025)
2025-05-07T20:32:40.4783989Z     
2025-05-07T20:32:40.4784425Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.4784754Z     
2025-05-07T20:32:40.4784944Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.4785218Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.4785518Z         x = x_sign * x_clamp
2025-05-07T20:32:40.4785757Z         x0 = x[:, :D]
2025-05-07T20:32:40.4785962Z         x1 = x[:, D:]
2025-05-07T20:32:40.4786158Z     
2025-05-07T20:32:40.4786331Z         if contiguous:
2025-05-07T20:32:40.4786550Z             x0 = x0.contiguous()
2025-05-07T20:32:40.4786825Z             x1 = x1.contiguous()
2025-05-07T20:32:40.4787080Z     
2025-05-07T20:32:40.4787258Z         if scale_ub is not None:
2025-05-07T20:32:40.4787622Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.4787948Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.4788243Z             )
2025-05-07T20:32:40.4788427Z         else:
2025-05-07T20:32:40.4788632Z             scale_ub_tensor = None
2025-05-07T20:32:40.4788881Z     
2025-05-07T20:32:40.4789102Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.4789409Z             op = silu_mul_quant
2025-05-07T20:32:40.4789648Z             if compiled:
2025-05-07T20:32:40.4789880Z                 op = torch.compile(op)
2025-05-07T20:32:40.4790163Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.4790422Z     
2025-05-07T20:32:40.4790599Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.4790763Z 
2025-05-07T20:32:40.4790857Z moe/activation_test.py:117: 
2025-05-07T20:32:40.4791143Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.4791458Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.4791723Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.4792411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.4793097Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.4793619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.4794413Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.4795069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.4795587Z     kernel = self.compile(
2025-05-07T20:32:40.4796124Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.4796863Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.4797436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.4797726Z 
2025-05-07T20:32:40.4797929Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c6640bd0>
2025-05-07T20:32:40.4799010Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.4800381Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c4ca8360>}
2025-05-07T20:32:40.4801733Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.4802741Z context = <triton._C.libtriton.ir.context object at 0x7f13c4c31430>
2025-05-07T20:32:40.4803025Z 
2025-05-07T20:32:40.4803190Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.4803703Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.4804273Z                            module_map=module_map)
2025-05-07T20:32:40.4804622Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.4804970Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.4805219Z E       ^
2025-05-07T20:32:40.4805673Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.4806109Z 
2025-05-07T20:32:40.4806539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.4807045Z 
2025-05-07T20:32:40.4807144Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.4807647Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.4808143Z     T=4096,
2025-05-07T20:32:40.4808328Z     D=5120,
2025-05-07T20:32:40.4808525Z     scale_ub=1200.0,
2025-05-07T20:32:40.4808747Z     contiguous=False,
2025-05-07T20:32:40.4808974Z     compiled=True,
2025-05-07T20:32:40.4809178Z )
2025-05-07T20:32:40.4809498Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.4809978Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:40.4810251Z 
2025-05-07T20:32:40.4810328Z     @given(
2025-05-07T20:32:40.4810550Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.4810850Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.4811154Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.4811477Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.4811800Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.4812075Z     )
2025-05-07T20:32:40.4812421Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.4812876Z     def test_silu_mul_quant(
2025-05-07T20:32:40.4813116Z         self,
2025-05-07T20:32:40.4813311Z         T: int,
2025-05-07T20:32:40.4813509Z         D: int,
2025-05-07T20:32:40.4813718Z         scale_ub: Optional[float],
2025-05-07T20:32:40.4813985Z         contiguous: bool,
2025-05-07T20:32:40.4814313Z         compiled: bool,
2025-05-07T20:32:40.4814528Z     ) -> None:
2025-05-07T20:32:40.4814755Z         torch.manual_seed(2025)
2025-05-07T20:32:40.4814991Z     
2025-05-07T20:32:40.4815250Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.4815590Z     
2025-05-07T20:32:40.4815781Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.4816060Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.4816367Z         x = x_sign * x_clamp
2025-05-07T20:32:40.4816609Z         x0 = x[:, :D]
2025-05-07T20:32:40.4816825Z         x1 = x[:, D:]
2025-05-07T20:32:40.4817021Z     
2025-05-07T20:32:40.4817203Z         if contiguous:
2025-05-07T20:32:40.4817439Z             x0 = x0.contiguous()
2025-05-07T20:32:40.4817686Z             x1 = x1.contiguous()
2025-05-07T20:32:40.4817935Z     
2025-05-07T20:32:40.4818122Z         if scale_ub is not None:
2025-05-07T20:32:40.4818388Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.4818717Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.4819026Z             )
2025-05-07T20:32:40.4819211Z         else:
2025-05-07T20:32:40.4819416Z             scale_ub_tensor = None
2025-05-07T20:32:40.4819660Z     
2025-05-07T20:32:40.4819881Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.4820188Z             op = silu_mul_quant
2025-05-07T20:32:40.4820428Z             if compiled:
2025-05-07T20:32:40.4820665Z                 op = torch.compile(op)
2025-05-07T20:32:40.4820959Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.4821233Z     
2025-05-07T20:32:40.4821418Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.4821585Z 
2025-05-07T20:32:40.4821680Z moe/activation_test.py:117: 
2025-05-07T20:32:40.4821963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.4822381Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.4822653Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.4823219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:40.4823762Z     return fn(*args, **kwargs)
2025-05-07T20:32:40.4824408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.4825093Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.4825628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.4826288Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.4826943Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.4827532Z     kernel = self.compile(
2025-05-07T20:32:40.4828094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.4828742Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.4829127Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.4829356Z 
2025-05-07T20:32:40.4829562Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c4f4bd50>
2025-05-07T20:32:40.4830617Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.4831977Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c4ca94e0>}
2025-05-07T20:32:40.4833408Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.4834423Z context = <triton._C.libtriton.ir.context object at 0x7f13c4c1e9f0>
2025-05-07T20:32:40.4834702Z 
2025-05-07T20:32:40.4834870Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.4835379Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.4835849Z                            module_map=module_map)
2025-05-07T20:32:40.4836205Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.4836561Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.4836859Z E       ^
2025-05-07T20:32:40.4837314Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.4837753Z 
2025-05-07T20:32:40.4838182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.4838685Z 
2025-05-07T20:32:40.4838792Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.4839201Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.4839589Z     T=2048,
2025-05-07T20:32:40.4839763Z     D=7168,
2025-05-07T20:32:40.4839942Z     scale_ub=1200.0,
2025-05-07T20:32:40.4840599Z     contiguous=False,
2025-05-07T20:32:40.4840875Z     compiled=False,
2025-05-07T20:32:40.6784083Z )
2025-05-07T20:32:40.6784743Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.6785485Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:40.6785867Z 
2025-05-07T20:32:40.6785970Z     @given(
2025-05-07T20:32:40.6786282Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.6786711Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.6787245Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.6787674Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.6787993Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.6788279Z     )
2025-05-07T20:32:40.6788630Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.6789092Z     def test_silu_mul_quant(
2025-05-07T20:32:40.6789322Z         self,
2025-05-07T20:32:40.6789515Z         T: int,
2025-05-07T20:32:40.6789706Z         D: int,
2025-05-07T20:32:40.6789913Z         scale_ub: Optional[float],
2025-05-07T20:32:40.6790174Z         contiguous: bool,
2025-05-07T20:32:40.6790405Z         compiled: bool,
2025-05-07T20:32:40.6790619Z     ) -> None:
2025-05-07T20:32:40.6790830Z         torch.manual_seed(2025)
2025-05-07T20:32:40.6791071Z     
2025-05-07T20:32:40.6791339Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.6791682Z     
2025-05-07T20:32:40.6791866Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.6792143Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.6792450Z         x = x_sign * x_clamp
2025-05-07T20:32:40.6792681Z         x0 = x[:, :D]
2025-05-07T20:32:40.6792886Z         x1 = x[:, D:]
2025-05-07T20:32:40.6793083Z     
2025-05-07T20:32:40.6793268Z         if contiguous:
2025-05-07T20:32:40.6793496Z             x0 = x0.contiguous()
2025-05-07T20:32:40.6793741Z             x1 = x1.contiguous()
2025-05-07T20:32:40.6793978Z     
2025-05-07T20:32:40.6794160Z         if scale_ub is not None:
2025-05-07T20:32:40.6794421Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.6794749Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.6795042Z             )
2025-05-07T20:32:40.6795219Z         else:
2025-05-07T20:32:40.6795423Z             scale_ub_tensor = None
2025-05-07T20:32:40.6795667Z     
2025-05-07T20:32:40.6795885Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.6796195Z             op = silu_mul_quant
2025-05-07T20:32:40.6796442Z             if compiled:
2025-05-07T20:32:40.6796839Z                 op = torch.compile(op)
2025-05-07T20:32:40.6797138Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.6797403Z     
2025-05-07T20:32:40.6797585Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.6797748Z 
2025-05-07T20:32:40.6797843Z moe/activation_test.py:117: 
2025-05-07T20:32:40.6798125Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.6798443Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.6798705Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.6799381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.6800057Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.6800580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.6801244Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.6801899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.6802415Z     kernel = self.compile(
2025-05-07T20:32:40.6802957Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.6803591Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.6803980Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.6804202Z 
2025-05-07T20:32:40.6804404Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c66433d0>
2025-05-07T20:32:40.6805457Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.6806945Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c4ca9f80>}
2025-05-07T20:32:40.6808296Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.6809289Z context = <triton._C.libtriton.ir.context object at 0x7f13c4b9e0b0>
2025-05-07T20:32:40.6809569Z 
2025-05-07T20:32:40.6809734Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.6810239Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.6810691Z                            module_map=module_map)
2025-05-07T20:32:40.6811044Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.6811393Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.6811644Z E       ^
2025-05-07T20:32:40.6812099Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.6812537Z 
2025-05-07T20:32:40.6812955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.6813454Z 
2025-05-07T20:32:40.6813554Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.6813952Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.6814352Z     T=1,
2025-05-07T20:32:40.6814529Z     D=7168,
2025-05-07T20:32:40.6814706Z     scale_ub=None,
2025-05-07T20:32:40.6814914Z     contiguous=True,
2025-05-07T20:32:40.6815128Z     compiled=False,
2025-05-07T20:32:40.6815318Z )
2025-05-07T20:32:40.6815625Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.6816098Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:40.6816356Z 
2025-05-07T20:32:40.6816515Z     @given(
2025-05-07T20:32:40.6816735Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.6817037Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.6817327Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.6817643Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.6817961Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.6818233Z     )
2025-05-07T20:32:40.6818565Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.6818996Z     def test_silu_mul_quant(
2025-05-07T20:32:40.6819226Z         self,
2025-05-07T20:32:40.6819406Z         T: int,
2025-05-07T20:32:40.6819598Z         D: int,
2025-05-07T20:32:40.6819803Z         scale_ub: Optional[float],
2025-05-07T20:32:40.6820058Z         contiguous: bool,
2025-05-07T20:32:40.6820291Z         compiled: bool,
2025-05-07T20:32:40.6820502Z     ) -> None:
2025-05-07T20:32:40.6820698Z         torch.manual_seed(2025)
2025-05-07T20:32:40.6820934Z     
2025-05-07T20:32:40.6821200Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.6821518Z     
2025-05-07T20:32:40.6821696Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.6821978Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.6822270Z         x = x_sign * x_clamp
2025-05-07T20:32:40.6822499Z         x0 = x[:, :D]
2025-05-07T20:32:40.6822700Z         x1 = x[:, D:]
2025-05-07T20:32:40.6822903Z     
2025-05-07T20:32:40.6823068Z         if contiguous:
2025-05-07T20:32:40.6823290Z             x0 = x0.contiguous()
2025-05-07T20:32:40.6823543Z             x1 = x1.contiguous()
2025-05-07T20:32:40.6823775Z     
2025-05-07T20:32:40.6823956Z         if scale_ub is not None:
2025-05-07T20:32:40.6824215Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.6824623Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.6824920Z             )
2025-05-07T20:32:40.6825099Z         else:
2025-05-07T20:32:40.6825298Z             scale_ub_tensor = None
2025-05-07T20:32:40.6825541Z     
2025-05-07T20:32:40.6825763Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.6826057Z             op = silu_mul_quant
2025-05-07T20:32:40.6826299Z             if compiled:
2025-05-07T20:32:40.6826539Z                 op = torch.compile(op)
2025-05-07T20:32:40.6826819Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.6827088Z     
2025-05-07T20:32:40.6827279Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.6827486Z 
2025-05-07T20:32:40.6827589Z moe/activation_test.py:117: 
2025-05-07T20:32:40.6827873Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.6828190Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.6828466Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.6829144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.6829811Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.6830333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.6830996Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.6831648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.6832171Z     kernel = self.compile(
2025-05-07T20:32:40.6832700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.6833340Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.6833726Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.6833954Z 
2025-05-07T20:32:40.6834154Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c4ecd8d0>
2025-05-07T20:32:40.6835333Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.6836676Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c4cab2e0>}
2025-05-07T20:32:40.6837988Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.6839035Z context = <triton._C.libtriton.ir.context object at 0x7f13c546f6b0>
2025-05-07T20:32:40.6839312Z 
2025-05-07T20:32:40.6839481Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.6839993Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.6840719Z                            module_map=module_map)
2025-05-07T20:32:40.6841072Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.6841414Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.6841660Z E       ^
2025-05-07T20:32:40.6842107Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.6842541Z 
2025-05-07T20:32:40.6842950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.6843452Z 
2025-05-07T20:32:40.6843553Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.6843941Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.6844491Z     T=16384,
2025-05-07T20:32:40.6844677Z     D=7168,
2025-05-07T20:32:40.6844852Z     scale_ub=1200.0,
2025-05-07T20:32:40.6845063Z     contiguous=False,
2025-05-07T20:32:40.6845293Z     compiled=True,
2025-05-07T20:32:40.6845483Z )
2025-05-07T20:32:40.6845789Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.6846276Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:40.6846550Z 
2025-05-07T20:32:40.6846627Z     @given(
2025-05-07T20:32:40.6846843Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.6847155Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.6855198Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.6855563Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.6855887Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.6856158Z     )
2025-05-07T20:32:40.6856502Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.6857006Z     def test_silu_mul_quant(
2025-05-07T20:32:40.6857240Z         self,
2025-05-07T20:32:40.6857433Z         T: int,
2025-05-07T20:32:40.6857620Z         D: int,
2025-05-07T20:32:40.6857823Z         scale_ub: Optional[float],
2025-05-07T20:32:40.6858087Z         contiguous: bool,
2025-05-07T20:32:40.6858316Z         compiled: bool,
2025-05-07T20:32:40.6858533Z     ) -> None:
2025-05-07T20:32:40.6858738Z         torch.manual_seed(2025)
2025-05-07T20:32:40.6858975Z     
2025-05-07T20:32:40.6859239Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.6859571Z     
2025-05-07T20:32:40.6859750Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.6860031Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.6860335Z         x = x_sign * x_clamp
2025-05-07T20:32:40.6860560Z         x0 = x[:, :D]
2025-05-07T20:32:40.6860764Z         x1 = x[:, D:]
2025-05-07T20:32:40.6860961Z     
2025-05-07T20:32:40.6861130Z         if contiguous:
2025-05-07T20:32:40.6861349Z             x0 = x0.contiguous()
2025-05-07T20:32:40.6861595Z             x1 = x1.contiguous()
2025-05-07T20:32:40.6861969Z     
2025-05-07T20:32:40.6862153Z         if scale_ub is not None:
2025-05-07T20:32:40.6862418Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.6862733Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.6863030Z             )
2025-05-07T20:32:40.6863211Z         else:
2025-05-07T20:32:40.6863408Z             scale_ub_tensor = None
2025-05-07T20:32:40.6863645Z     
2025-05-07T20:32:40.6863866Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.6864168Z             op = silu_mul_quant
2025-05-07T20:32:40.6864402Z             if compiled:
2025-05-07T20:32:40.6864641Z                 op = torch.compile(op)
2025-05-07T20:32:40.6864922Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.6865175Z     
2025-05-07T20:32:40.6865359Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.6865521Z 
2025-05-07T20:32:40.6865623Z moe/activation_test.py:117: 
2025-05-07T20:32:40.6865906Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.6866225Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.6866493Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.6867039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:40.6867655Z     return fn(*args, **kwargs)
2025-05-07T20:32:40.6868303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.6868971Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.6869491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.6870153Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.6870894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.6871417Z     kernel = self.compile(
2025-05-07T20:32:40.6871960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.6872601Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.6872996Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.6873215Z 
2025-05-07T20:32:40.6873418Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c63843d0>
2025-05-07T20:32:40.6874479Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.6875835Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c4bb85e0>}
2025-05-07T20:32:40.6877154Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.6878150Z context = <triton._C.libtriton.ir.context object at 0x7f13c5464170>
2025-05-07T20:32:40.6878428Z 
2025-05-07T20:32:40.6878589Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.6879109Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.6879575Z                            module_map=module_map)
2025-05-07T20:32:40.6879933Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.6880279Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.6880531Z E       ^
2025-05-07T20:32:40.6880981Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.6881496Z 
2025-05-07T20:32:40.6881913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.8166300Z 
2025-05-07T20:32:40.8166664Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.8167084Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.8167577Z     T=1,
2025-05-07T20:32:40.8167830Z     D=7168,
2025-05-07T20:32:40.8168092Z     scale_ub=None,
2025-05-07T20:32:40.8168377Z     contiguous=False,
2025-05-07T20:32:40.8168687Z     compiled=False,
2025-05-07T20:32:40.8168966Z )
2025-05-07T20:32:40.8169305Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.8169799Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:40.8170065Z 
2025-05-07T20:32:40.8170143Z     @given(
2025-05-07T20:32:40.8170362Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.8170684Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.8170985Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.8171321Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.8171652Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.8171924Z     )
2025-05-07T20:32:40.8172263Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.8172705Z     def test_silu_mul_quant(
2025-05-07T20:32:40.8172942Z         self,
2025-05-07T20:32:40.8173128Z         T: int,
2025-05-07T20:32:40.8173314Z         D: int,
2025-05-07T20:32:40.8173523Z         scale_ub: Optional[float],
2025-05-07T20:32:40.8173789Z         contiguous: bool,
2025-05-07T20:32:40.8174026Z         compiled: bool,
2025-05-07T20:32:40.8174240Z     ) -> None:
2025-05-07T20:32:40.8174628Z         torch.manual_seed(2025)
2025-05-07T20:32:40.8174861Z     
2025-05-07T20:32:40.8175125Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.8175468Z     
2025-05-07T20:32:40.8175649Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.8175942Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.8176246Z         x = x_sign * x_clamp
2025-05-07T20:32:40.8176478Z         x0 = x[:, :D]
2025-05-07T20:32:40.8176678Z         x1 = x[:, D:]
2025-05-07T20:32:40.8176903Z     
2025-05-07T20:32:40.8177104Z         if contiguous:
2025-05-07T20:32:40.8177336Z             x0 = x0.contiguous()
2025-05-07T20:32:40.8177590Z             x1 = x1.contiguous()
2025-05-07T20:32:40.8177825Z     
2025-05-07T20:32:40.8178010Z         if scale_ub is not None:
2025-05-07T20:32:40.8178283Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.8178613Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.8178911Z             )
2025-05-07T20:32:40.8179093Z         else:
2025-05-07T20:32:40.8179298Z             scale_ub_tensor = None
2025-05-07T20:32:40.8179535Z     
2025-05-07T20:32:40.8179765Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.8180070Z             op = silu_mul_quant
2025-05-07T20:32:40.8180302Z             if compiled:
2025-05-07T20:32:40.8180536Z                 op = torch.compile(op)
2025-05-07T20:32:40.8180820Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.8181083Z     
2025-05-07T20:32:40.8181263Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.8181427Z 
2025-05-07T20:32:40.8181522Z moe/activation_test.py:117: 
2025-05-07T20:32:40.8181807Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.8182126Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.8182396Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.8183077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.8183755Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.8184392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.8185064Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.8185730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.8186239Z     kernel = self.compile(
2025-05-07T20:32:40.8186782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.8187424Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.8187881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.8188099Z 
2025-05-07T20:32:40.8188299Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c4ecf550>
2025-05-07T20:32:40.8189365Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.8190714Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c4bb8fe0>}
2025-05-07T20:32:40.8192026Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.8193016Z context = <triton._C.libtriton.ir.context object at 0x7f13c4da10b0>
2025-05-07T20:32:40.8193297Z 
2025-05-07T20:32:40.8193457Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.8193967Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.8194519Z                            module_map=module_map)
2025-05-07T20:32:40.8194873Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.8195226Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.8195478Z E       ^
2025-05-07T20:32:40.8195921Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.8196362Z 
2025-05-07T20:32:40.8196766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.8197267Z 
2025-05-07T20:32:40.8197366Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.8197768Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.8198160Z     T=2048,
2025-05-07T20:32:40.8198340Z     D=7168,
2025-05-07T20:32:40.8198526Z     scale_ub=None,
2025-05-07T20:32:40.8198737Z     contiguous=False,
2025-05-07T20:32:40.8198956Z     compiled=True,
2025-05-07T20:32:40.8199152Z )
2025-05-07T20:32:40.8199460Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:40.8199938Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:40.8200211Z 
2025-05-07T20:32:40.8200284Z     @given(
2025-05-07T20:32:40.8200503Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:40.8200799Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:40.8201094Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:40.8201412Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:40.8201723Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:40.8201999Z     )
2025-05-07T20:32:40.8202331Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:40.8202776Z     def test_silu_mul_quant(
2025-05-07T20:32:40.8203009Z         self,
2025-05-07T20:32:40.8203195Z         T: int,
2025-05-07T20:32:40.8203384Z         D: int,
2025-05-07T20:32:40.8203587Z         scale_ub: Optional[float],
2025-05-07T20:32:40.8203933Z         contiguous: bool,
2025-05-07T20:32:40.8204167Z         compiled: bool,
2025-05-07T20:32:40.8204379Z     ) -> None:
2025-05-07T20:32:40.8204590Z         torch.manual_seed(2025)
2025-05-07T20:32:40.8204822Z     
2025-05-07T20:32:40.8205079Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:40.8205410Z     
2025-05-07T20:32:40.8205595Z         x_sign = torch.sign(x)
2025-05-07T20:32:40.8205867Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:40.8206162Z         x = x_sign * x_clamp
2025-05-07T20:32:40.8206391Z         x0 = x[:, :D]
2025-05-07T20:32:40.8206595Z         x1 = x[:, D:]
2025-05-07T20:32:40.8206799Z     
2025-05-07T20:32:40.8206973Z         if contiguous:
2025-05-07T20:32:40.8207190Z             x0 = x0.contiguous()
2025-05-07T20:32:40.8207443Z             x1 = x1.contiguous()
2025-05-07T20:32:40.8207674Z     
2025-05-07T20:32:40.8207853Z         if scale_ub is not None:
2025-05-07T20:32:40.8208126Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:40.8208453Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:40.8208743Z             )
2025-05-07T20:32:40.8208934Z         else:
2025-05-07T20:32:40.8209140Z             scale_ub_tensor = None
2025-05-07T20:32:40.8209379Z     
2025-05-07T20:32:40.8209600Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:40.8209902Z             op = silu_mul_quant
2025-05-07T20:32:40.8210148Z             if compiled:
2025-05-07T20:32:40.8210381Z                 op = torch.compile(op)
2025-05-07T20:32:40.8210671Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.8210945Z     
2025-05-07T20:32:40.8211121Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:40.8211280Z 
2025-05-07T20:32:40.8211372Z moe/activation_test.py:117: 
2025-05-07T20:32:40.8211775Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.8212094Z moe/activation_test.py:115: in fn
2025-05-07T20:32:40.8212368Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:40.8212914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:40.8213457Z     return fn(*args, **kwargs)
2025-05-07T20:32:40.8214095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:40.8214766Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:40.8215296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:40.8215957Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:40.8216607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:40.8217188Z     kernel = self.compile(
2025-05-07T20:32:40.8217726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:40.8218357Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:40.8218737Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:40.8218958Z 
2025-05-07T20:32:40.8219159Z self = <triton.compiler.compiler.ASTSource object at 0x7f13e11aef50>
2025-05-07T20:32:40.8220213Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:40.8221548Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c4bba7a0>}
2025-05-07T20:32:40.8222993Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:40.8224001Z context = <triton._C.libtriton.ir.context object at 0x7f13c54bb9b0>
2025-05-07T20:32:40.8224288Z 
2025-05-07T20:32:40.8224450Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:40.8224961Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:40.8225423Z                            module_map=module_map)
2025-05-07T20:32:40.8225783Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:40.8226131Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:40.8226377Z E       ^
2025-05-07T20:32:40.8226866Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:40.8227320Z 
2025-05-07T20:32:40.8227786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:40.8228286Z 
2025-05-07T20:32:40.8228388Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:40.8228786Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:40.8229173Z     T=4096,
2025-05-07T20:32:40.8229348Z     D=7168,
2025-05-07T20:32:40.8229524Z     scale_ub=None,
2025-05-07T20:32:40.8229727Z     contiguous=False,
2025-05-07T20:32:40.8229942Z     compiled=True,
2025-05-07T20:32:41.0462850Z )
2025-05-07T20:32:41.0463527Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.0464244Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:41.0464609Z 
2025-05-07T20:32:41.0464724Z     @given(
2025-05-07T20:32:41.0465043Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.0465765Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.0466166Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.0466618Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.0467096Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.0467515Z     )
2025-05-07T20:32:41.0467910Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.0468362Z     def test_silu_mul_quant(
2025-05-07T20:32:41.0468759Z         self,
2025-05-07T20:32:41.0468991Z         T: int,
2025-05-07T20:32:41.0469182Z         D: int,
2025-05-07T20:32:41.0469391Z         scale_ub: Optional[float],
2025-05-07T20:32:41.0469662Z         contiguous: bool,
2025-05-07T20:32:41.0469891Z         compiled: bool,
2025-05-07T20:32:41.0470109Z     ) -> None:
2025-05-07T20:32:41.0470330Z         torch.manual_seed(2025)
2025-05-07T20:32:41.0470563Z     
2025-05-07T20:32:41.0470826Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.0471166Z     
2025-05-07T20:32:41.0471353Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.0471635Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.0471952Z         x = x_sign * x_clamp
2025-05-07T20:32:41.0472192Z         x0 = x[:, :D]
2025-05-07T20:32:41.0472402Z         x1 = x[:, D:]
2025-05-07T20:32:41.0472616Z     
2025-05-07T20:32:41.0472793Z         if contiguous:
2025-05-07T20:32:41.0473016Z             x0 = x0.contiguous()
2025-05-07T20:32:41.0473274Z             x1 = x1.contiguous()
2025-05-07T20:32:41.0473515Z     
2025-05-07T20:32:41.0473693Z         if scale_ub is not None:
2025-05-07T20:32:41.0473956Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.0474290Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.0474595Z             )
2025-05-07T20:32:41.0474781Z         else:
2025-05-07T20:32:41.0474986Z             scale_ub_tensor = None
2025-05-07T20:32:41.0475231Z     
2025-05-07T20:32:41.0475451Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.0475755Z             op = silu_mul_quant
2025-05-07T20:32:41.0476139Z             if compiled:
2025-05-07T20:32:41.0476382Z                 op = torch.compile(op)
2025-05-07T20:32:41.0476680Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.0476950Z     
2025-05-07T20:32:41.0477129Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.0477291Z 
2025-05-07T20:32:41.0477387Z moe/activation_test.py:117: 
2025-05-07T20:32:41.0477670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.0477982Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.0478251Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.0478800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.0479348Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.0479992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.0480675Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.0481202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.0481872Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.0482535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.0483052Z     kernel = self.compile(
2025-05-07T20:32:41.0483582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.0484220Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.0484604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.0484918Z 
2025-05-07T20:32:41.0485119Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c4ecd3d0>
2025-05-07T20:32:41.0486186Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.0487589Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c4bbb4c0>}
2025-05-07T20:32:41.0488951Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.0489999Z context = <triton._C.libtriton.ir.context object at 0x7f13c4f8a630>
2025-05-07T20:32:41.0490277Z 
2025-05-07T20:32:41.0490446Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.0490966Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.0491420Z                            module_map=module_map)
2025-05-07T20:32:41.0491773Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.0492117Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.0492362Z E       ^
2025-05-07T20:32:41.0492810Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.0493248Z 
2025-05-07T20:32:41.0493656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.0494157Z 
2025-05-07T20:32:41.0494263Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.0494666Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.0495066Z     T=16384,
2025-05-07T20:32:41.0495260Z     D=5120,
2025-05-07T20:32:41.0495441Z     scale_ub=1200.0,
2025-05-07T20:32:41.0495658Z     contiguous=False,
2025-05-07T20:32:41.0495881Z     compiled=False,
2025-05-07T20:32:41.0496162Z )
2025-05-07T20:32:41.0496476Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.0496972Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:41.0497244Z 
2025-05-07T20:32:41.0497323Z     @given(
2025-05-07T20:32:41.0497540Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.0497844Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.0498137Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.0498447Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.0498762Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.0499036Z     )
2025-05-07T20:32:41.0499365Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.0499817Z     def test_silu_mul_quant(
2025-05-07T20:32:41.0500050Z         self,
2025-05-07T20:32:41.0500232Z         T: int,
2025-05-07T20:32:41.0500422Z         D: int,
2025-05-07T20:32:41.0500633Z         scale_ub: Optional[float],
2025-05-07T20:32:41.0500892Z         contiguous: bool,
2025-05-07T20:32:41.0501132Z         compiled: bool,
2025-05-07T20:32:41.0501350Z     ) -> None:
2025-05-07T20:32:41.0501555Z         torch.manual_seed(2025)
2025-05-07T20:32:41.0501778Z     
2025-05-07T20:32:41.0502042Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.0502378Z     
2025-05-07T20:32:41.0502559Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.0502840Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.0503145Z         x = x_sign * x_clamp
2025-05-07T20:32:41.0503372Z         x0 = x[:, :D]
2025-05-07T20:32:41.0503580Z         x1 = x[:, D:]
2025-05-07T20:32:41.0503774Z     
2025-05-07T20:32:41.0503944Z         if contiguous:
2025-05-07T20:32:41.0504253Z             x0 = x0.contiguous()
2025-05-07T20:32:41.0504502Z             x1 = x1.contiguous()
2025-05-07T20:32:41.0504724Z     
2025-05-07T20:32:41.0504919Z         if scale_ub is not None:
2025-05-07T20:32:41.0505179Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.0505505Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.0505808Z             )
2025-05-07T20:32:41.0505987Z         else:
2025-05-07T20:32:41.0506190Z             scale_ub_tensor = None
2025-05-07T20:32:41.0506432Z     
2025-05-07T20:32:41.0506654Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.0506967Z             op = silu_mul_quant
2025-05-07T20:32:41.0514426Z             if compiled:
2025-05-07T20:32:41.0514689Z                 op = torch.compile(op)
2025-05-07T20:32:41.0514984Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.0515247Z     
2025-05-07T20:32:41.0515441Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.0515622Z 
2025-05-07T20:32:41.0515719Z moe/activation_test.py:117: 
2025-05-07T20:32:41.0516019Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.0516345Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.0516621Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.0517356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.0518031Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.0518571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.0519237Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.0519903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.0520422Z     kernel = self.compile(
2025-05-07T20:32:41.0520959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.0521705Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.0522092Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.0522313Z 
2025-05-07T20:32:41.0522520Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c63bad50>
2025-05-07T20:32:41.0523576Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.0524925Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c4f1c860>}
2025-05-07T20:32:41.0526234Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.0527239Z context = <triton._C.libtriton.ir.context object at 0x7f13c4fd47f0>
2025-05-07T20:32:41.0527518Z 
2025-05-07T20:32:41.0527682Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.0528182Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.0528637Z                            module_map=module_map)
2025-05-07T20:32:41.0528988Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.0529333Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.0529581Z E       ^
2025-05-07T20:32:41.0530028Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.0530466Z 
2025-05-07T20:32:41.0530902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.0531487Z 
2025-05-07T20:32:41.0531587Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.0531989Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.0532378Z     T=16384,
2025-05-07T20:32:41.0532568Z     D=5120,
2025-05-07T20:32:41.0532751Z     scale_ub=1200.0,
2025-05-07T20:32:41.0532967Z     contiguous=True,
2025-05-07T20:32:41.0533177Z     compiled=True,
2025-05-07T20:32:41.0533369Z )
2025-05-07T20:32:41.0533682Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.0534164Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:41.0534437Z 
2025-05-07T20:32:41.0534508Z     @given(
2025-05-07T20:32:41.0534729Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.0535029Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.0535322Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.0535640Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.0535961Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.0536240Z     )
2025-05-07T20:32:41.0536572Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.0537017Z     def test_silu_mul_quant(
2025-05-07T20:32:41.0537249Z         self,
2025-05-07T20:32:41.0537431Z         T: int,
2025-05-07T20:32:41.0537619Z         D: int,
2025-05-07T20:32:41.0537827Z         scale_ub: Optional[float],
2025-05-07T20:32:41.0538084Z         contiguous: bool,
2025-05-07T20:32:41.0538314Z         compiled: bool,
2025-05-07T20:32:41.0538530Z     ) -> None:
2025-05-07T20:32:41.0538728Z         torch.manual_seed(2025)
2025-05-07T20:32:41.0538966Z     
2025-05-07T20:32:41.0539226Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.0539562Z     
2025-05-07T20:32:41.0539740Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.0540021Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.0540634Z         x = x_sign * x_clamp
2025-05-07T20:32:41.0541006Z         x0 = x[:, :D]
2025-05-07T20:32:41.0541221Z         x1 = x[:, D:]
2025-05-07T20:32:41.0541420Z     
2025-05-07T20:32:41.0541593Z         if contiguous:
2025-05-07T20:32:41.0541811Z             x0 = x0.contiguous()
2025-05-07T20:32:41.0542056Z             x1 = x1.contiguous()
2025-05-07T20:32:41.0542286Z     
2025-05-07T20:32:41.0542468Z         if scale_ub is not None:
2025-05-07T20:32:41.0542729Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.0543046Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.0543349Z             )
2025-05-07T20:32:41.0543534Z         else:
2025-05-07T20:32:41.0543728Z             scale_ub_tensor = None
2025-05-07T20:32:41.0543973Z     
2025-05-07T20:32:41.0544195Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.0544499Z             op = silu_mul_quant
2025-05-07T20:32:41.0544735Z             if compiled:
2025-05-07T20:32:41.0544973Z                 op = torch.compile(op)
2025-05-07T20:32:41.0545260Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.0545518Z     
2025-05-07T20:32:41.0545702Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.0545860Z 
2025-05-07T20:32:41.0545960Z moe/activation_test.py:117: 
2025-05-07T20:32:41.0546236Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.0546550Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.0546816Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.0547355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.0548052Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.0548829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.0549648Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.0550177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.0550836Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.0551503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.0552016Z     kernel = self.compile(
2025-05-07T20:32:41.0552562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.0553205Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.0553591Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.0553810Z 
2025-05-07T20:32:41.0554015Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c4baf1d0>
2025-05-07T20:32:41.0555083Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.0556432Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c4f1db20>}
2025-05-07T20:32:41.0557799Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.0558802Z context = <triton._C.libtriton.ir.context object at 0x7f13c4869fb0>
2025-05-07T20:32:41.0559078Z 
2025-05-07T20:32:41.0559237Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.0559750Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.0560211Z                            module_map=module_map)
2025-05-07T20:32:41.0560650Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.0560996Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.0561242Z E       ^
2025-05-07T20:32:41.0561693Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.0562132Z 
2025-05-07T20:32:41.0562557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.2083364Z 
2025-05-07T20:32:41.2083685Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.2084101Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.2084559Z     T=16384,
2025-05-07T20:32:41.2084755Z     D=5120,
2025-05-07T20:32:41.2085031Z     scale_ub=None,
2025-05-07T20:32:41.2085324Z     contiguous=False,
2025-05-07T20:32:41.2085655Z     compiled=True,
2025-05-07T20:32:41.2085944Z )
2025-05-07T20:32:41.2086295Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.2086802Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:41.2087072Z 
2025-05-07T20:32:41.2087157Z     @given(
2025-05-07T20:32:41.2087379Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.2087691Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.2087993Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.2088334Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.2088647Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.2088929Z     )
2025-05-07T20:32:41.2089272Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.2089701Z     def test_silu_mul_quant(
2025-05-07T20:32:41.2089944Z         self,
2025-05-07T20:32:41.2090315Z         T: int,
2025-05-07T20:32:41.2090511Z         D: int,
2025-05-07T20:32:41.2090734Z         scale_ub: Optional[float],
2025-05-07T20:32:41.2090999Z         contiguous: bool,
2025-05-07T20:32:41.2091240Z         compiled: bool,
2025-05-07T20:32:41.2091465Z     ) -> None:
2025-05-07T20:32:41.2091683Z         torch.manual_seed(2025)
2025-05-07T20:32:41.2091918Z     
2025-05-07T20:32:41.2092188Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.2092531Z     
2025-05-07T20:32:41.2092721Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.2093000Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.2093307Z         x = x_sign * x_clamp
2025-05-07T20:32:41.2093553Z         x0 = x[:, :D]
2025-05-07T20:32:41.2093757Z         x1 = x[:, D:]
2025-05-07T20:32:41.2093961Z     
2025-05-07T20:32:41.2094140Z         if contiguous:
2025-05-07T20:32:41.2094363Z             x0 = x0.contiguous()
2025-05-07T20:32:41.2094621Z             x1 = x1.contiguous()
2025-05-07T20:32:41.2094861Z     
2025-05-07T20:32:41.2095042Z         if scale_ub is not None:
2025-05-07T20:32:41.2095314Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.2095651Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.2095949Z             )
2025-05-07T20:32:41.2096138Z         else:
2025-05-07T20:32:41.2096342Z             scale_ub_tensor = None
2025-05-07T20:32:41.2096577Z     
2025-05-07T20:32:41.2096823Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.2097152Z             op = silu_mul_quant
2025-05-07T20:32:41.2097398Z             if compiled:
2025-05-07T20:32:41.2097635Z                 op = torch.compile(op)
2025-05-07T20:32:41.2097923Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.2098190Z     
2025-05-07T20:32:41.2098367Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.2098527Z 
2025-05-07T20:32:41.2098622Z moe/activation_test.py:117: 
2025-05-07T20:32:41.2098910Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.2099227Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.2099501Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.2100184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.2100736Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.2101387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.2102060Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.2102592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.2103252Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.2103922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.2104457Z     kernel = self.compile(
2025-05-07T20:32:41.2104988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.2105634Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.2106017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.2106237Z 
2025-05-07T20:32:41.2106441Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c4f4a850>
2025-05-07T20:32:41.2107581Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.2108952Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c4f1e8e0>}
2025-05-07T20:32:41.2110386Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.2111387Z context = <triton._C.libtriton.ir.context object at 0x7f13c48b0c30>
2025-05-07T20:32:41.2111668Z 
2025-05-07T20:32:41.2111833Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.2112342Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.2112820Z                            module_map=module_map)
2025-05-07T20:32:41.2113179Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.2113522Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.2113778Z E       ^
2025-05-07T20:32:41.2114229Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.2114675Z 
2025-05-07T20:32:41.2115096Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.2115598Z 
2025-05-07T20:32:41.2115701Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.2116110Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.2116512Z     T=2048,
2025-05-07T20:32:41.2116689Z     D=5120,
2025-05-07T20:32:41.2116875Z     scale_ub=None,
2025-05-07T20:32:41.2117080Z     contiguous=False,
2025-05-07T20:32:41.2117291Z     compiled=True,
2025-05-07T20:32:41.2117483Z )
2025-05-07T20:32:41.2117788Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.2118272Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:41.2118532Z 
2025-05-07T20:32:41.2118610Z     @given(
2025-05-07T20:32:41.2118831Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.2119132Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.2119436Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.2119757Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.2120162Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.2120437Z     )
2025-05-07T20:32:41.2120786Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.2121226Z     def test_silu_mul_quant(
2025-05-07T20:32:41.2121466Z         self,
2025-05-07T20:32:41.2121647Z         T: int,
2025-05-07T20:32:41.2121834Z         D: int,
2025-05-07T20:32:41.2122053Z         scale_ub: Optional[float],
2025-05-07T20:32:41.2122310Z         contiguous: bool,
2025-05-07T20:32:41.2122542Z         compiled: bool,
2025-05-07T20:32:41.2122754Z     ) -> None:
2025-05-07T20:32:41.2122962Z         torch.manual_seed(2025)
2025-05-07T20:32:41.2123202Z     
2025-05-07T20:32:41.2123471Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.2123798Z     
2025-05-07T20:32:41.2123985Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.2124271Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.2124575Z         x = x_sign * x_clamp
2025-05-07T20:32:41.2124820Z         x0 = x[:, :D]
2025-05-07T20:32:41.2125032Z         x1 = x[:, D:]
2025-05-07T20:32:41.2125231Z     
2025-05-07T20:32:41.2125415Z         if contiguous:
2025-05-07T20:32:41.2125643Z             x0 = x0.contiguous()
2025-05-07T20:32:41.2125889Z             x1 = x1.contiguous()
2025-05-07T20:32:41.2126120Z     
2025-05-07T20:32:41.2126309Z         if scale_ub is not None:
2025-05-07T20:32:41.2126578Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.2126950Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.2127248Z             )
2025-05-07T20:32:41.2127433Z         else:
2025-05-07T20:32:41.2127638Z             scale_ub_tensor = None
2025-05-07T20:32:41.2127882Z     
2025-05-07T20:32:41.2128109Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.2128501Z             op = silu_mul_quant
2025-05-07T20:32:41.2128744Z             if compiled:
2025-05-07T20:32:41.2128991Z                 op = torch.compile(op)
2025-05-07T20:32:41.2129272Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.2129538Z     
2025-05-07T20:32:41.2129724Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.2129883Z 
2025-05-07T20:32:41.2129979Z moe/activation_test.py:117: 
2025-05-07T20:32:41.2130261Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.2130584Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.2130858Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.2131403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.2131956Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.2132616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.2133292Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.2133831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.2134509Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.2135160Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.2135669Z     kernel = self.compile(
2025-05-07T20:32:41.2136200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.2136833Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.2137219Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.2137440Z 
2025-05-07T20:32:41.2137639Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c4a146d0>
2025-05-07T20:32:41.2138785Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.2140374Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c48c4040>}
2025-05-07T20:32:41.2141705Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.2142705Z context = <triton._C.libtriton.ir.context object at 0x7f13c484fdf0>
2025-05-07T20:32:41.2142982Z 
2025-05-07T20:32:41.2143140Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.2143651Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.2144115Z                            module_map=module_map)
2025-05-07T20:32:41.2144467Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.2144815Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.2145056Z E       ^
2025-05-07T20:32:41.2145502Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.2145939Z 
2025-05-07T20:32:41.2146368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.3728778Z 
2025-05-07T20:32:41.3728923Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.3729337Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.3729827Z     T=2048,
2025-05-07T20:32:41.3730085Z     D=5120,
2025-05-07T20:32:41.3730344Z     scale_ub=1200.0,
2025-05-07T20:32:41.3730849Z     contiguous=False,
2025-05-07T20:32:41.3731145Z     compiled=True,
2025-05-07T20:32:41.3731421Z )
2025-05-07T20:32:41.3731853Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.3732457Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:41.3732725Z 
2025-05-07T20:32:41.3732800Z     @given(
2025-05-07T20:32:41.3733026Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.3733329Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.3733621Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.3733942Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.3734255Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.3734526Z     )
2025-05-07T20:32:41.3734874Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.3735309Z     def test_silu_mul_quant(
2025-05-07T20:32:41.3735542Z         self,
2025-05-07T20:32:41.3735738Z         T: int,
2025-05-07T20:32:41.3735927Z         D: int,
2025-05-07T20:32:41.3736140Z         scale_ub: Optional[float],
2025-05-07T20:32:41.3736406Z         contiguous: bool,
2025-05-07T20:32:41.3736638Z         compiled: bool,
2025-05-07T20:32:41.3736863Z     ) -> None:
2025-05-07T20:32:41.3737105Z         torch.manual_seed(2025)
2025-05-07T20:32:41.3737347Z     
2025-05-07T20:32:41.3737609Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.3737936Z     
2025-05-07T20:32:41.3738123Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.3738427Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.3738723Z         x = x_sign * x_clamp
2025-05-07T20:32:41.3738974Z         x0 = x[:, :D]
2025-05-07T20:32:41.3739233Z         x1 = x[:, D:]
2025-05-07T20:32:41.3739443Z     
2025-05-07T20:32:41.3739629Z         if contiguous:
2025-05-07T20:32:41.3739856Z             x0 = x0.contiguous()
2025-05-07T20:32:41.3740390Z             x1 = x1.contiguous()
2025-05-07T20:32:41.3740629Z     
2025-05-07T20:32:41.3740807Z         if scale_ub is not None:
2025-05-07T20:32:41.3741217Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.3741552Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.3741842Z             )
2025-05-07T20:32:41.3742044Z         else:
2025-05-07T20:32:41.3742248Z             scale_ub_tensor = None
2025-05-07T20:32:41.3742479Z     
2025-05-07T20:32:41.3742705Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.3743008Z             op = silu_mul_quant
2025-05-07T20:32:41.3743248Z             if compiled:
2025-05-07T20:32:41.3743490Z                 op = torch.compile(op)
2025-05-07T20:32:41.3743780Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.3744044Z     
2025-05-07T20:32:41.3744223Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.3744388Z 
2025-05-07T20:32:41.3744483Z moe/activation_test.py:117: 
2025-05-07T20:32:41.3744774Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.3745091Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.3745373Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.3745938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.3746484Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.3747137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.3747875Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.3748396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.3749064Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.3749715Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.3750361Z     kernel = self.compile(
2025-05-07T20:32:41.3750904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.3751537Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.3751937Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.3752157Z 
2025-05-07T20:32:41.3752368Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c4ece450>
2025-05-07T20:32:41.3753434Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.3754783Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c48c4e00>}
2025-05-07T20:32:41.3756109Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.3757165Z context = <triton._C.libtriton.ir.context object at 0x7f13c4d5e930>
2025-05-07T20:32:41.3757447Z 
2025-05-07T20:32:41.3757613Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.3758118Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.3758586Z                            module_map=module_map)
2025-05-07T20:32:41.3758954Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.3759299Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.3759549Z E       ^
2025-05-07T20:32:41.3760004Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.3760449Z 
2025-05-07T20:32:41.3760976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.3761480Z 
2025-05-07T20:32:41.3761578Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.3761981Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.3762372Z     T=4096,
2025-05-07T20:32:41.3762556Z     D=5120,
2025-05-07T20:32:41.3762734Z     scale_ub=1200.0,
2025-05-07T20:32:41.3762944Z     contiguous=True,
2025-05-07T20:32:41.3763171Z     compiled=True,
2025-05-07T20:32:41.3763365Z )
2025-05-07T20:32:41.3763674Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.3764159Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:41.3764421Z 
2025-05-07T20:32:41.3764494Z     @given(
2025-05-07T20:32:41.3764716Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.3765023Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.3765318Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.3773081Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.3773433Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.3773731Z     )
2025-05-07T20:32:41.3774078Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.3774527Z     def test_silu_mul_quant(
2025-05-07T20:32:41.3774766Z         self,
2025-05-07T20:32:41.3774950Z         T: int,
2025-05-07T20:32:41.3775130Z         D: int,
2025-05-07T20:32:41.3775338Z         scale_ub: Optional[float],
2025-05-07T20:32:41.3775606Z         contiguous: bool,
2025-05-07T20:32:41.3775833Z         compiled: bool,
2025-05-07T20:32:41.3776038Z     ) -> None:
2025-05-07T20:32:41.3776246Z         torch.manual_seed(2025)
2025-05-07T20:32:41.3776479Z     
2025-05-07T20:32:41.3776733Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.3777178Z     
2025-05-07T20:32:41.3777357Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.3777641Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.3777932Z         x = x_sign * x_clamp
2025-05-07T20:32:41.3778153Z         x0 = x[:, :D]
2025-05-07T20:32:41.3778355Z         x1 = x[:, D:]
2025-05-07T20:32:41.3778543Z     
2025-05-07T20:32:41.3778715Z         if contiguous:
2025-05-07T20:32:41.3778934Z             x0 = x0.contiguous()
2025-05-07T20:32:41.3779169Z             x1 = x1.contiguous()
2025-05-07T20:32:41.3779396Z     
2025-05-07T20:32:41.3779575Z         if scale_ub is not None:
2025-05-07T20:32:41.3779836Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.3780159Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.3780449Z             )
2025-05-07T20:32:41.3780634Z         else:
2025-05-07T20:32:41.3780833Z             scale_ub_tensor = None
2025-05-07T20:32:41.3781085Z     
2025-05-07T20:32:41.3781301Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.3781597Z             op = silu_mul_quant
2025-05-07T20:32:41.3781839Z             if compiled:
2025-05-07T20:32:41.3782076Z                 op = torch.compile(op)
2025-05-07T20:32:41.3782354Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.3782617Z     
2025-05-07T20:32:41.3782796Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.3782956Z 
2025-05-07T20:32:41.3783050Z moe/activation_test.py:117: 
2025-05-07T20:32:41.3783331Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.3783647Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.3783913Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.3784461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.3785011Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.3785668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.3786427Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.3786952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.3787685Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.3788327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.3788842Z     kernel = self.compile(
2025-05-07T20:32:41.3789376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.3790036Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.3790422Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.3790654Z 
2025-05-07T20:32:41.3790854Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c4f09750>
2025-05-07T20:32:41.3791911Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.3793257Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c48c60c0>}
2025-05-07T20:32:41.3794568Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.3795565Z context = <triton._C.libtriton.ir.context object at 0x7f13c4d46bf0>
2025-05-07T20:32:41.3795843Z 
2025-05-07T20:32:41.3795999Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.3796583Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.3797034Z                            module_map=module_map)
2025-05-07T20:32:41.3797388Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.3797736Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.3797979Z E       ^
2025-05-07T20:32:41.3798419Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.3798858Z 
2025-05-07T20:32:41.3799271Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.5469368Z 
2025-05-07T20:32:41.5469673Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.5470309Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.5470890Z     T=128,
2025-05-07T20:32:41.5471126Z     D=5120,
2025-05-07T20:32:41.5471311Z     scale_ub=1200.0,
2025-05-07T20:32:41.5471530Z     contiguous=False,
2025-05-07T20:32:41.5471749Z     compiled=True,
2025-05-07T20:32:41.5471957Z )
2025-05-07T20:32:41.5472266Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.5472768Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:41.5473034Z 
2025-05-07T20:32:41.5473123Z     @given(
2025-05-07T20:32:41.5473346Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.5473659Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.5473962Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.5474294Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.5474618Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.5474906Z     )
2025-05-07T20:32:41.5475243Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.5475689Z     def test_silu_mul_quant(
2025-05-07T20:32:41.5475934Z         self,
2025-05-07T20:32:41.5476133Z         T: int,
2025-05-07T20:32:41.5476515Z         D: int,
2025-05-07T20:32:41.5476746Z         scale_ub: Optional[float],
2025-05-07T20:32:41.5477063Z         contiguous: bool,
2025-05-07T20:32:41.5477299Z         compiled: bool,
2025-05-07T20:32:41.5477525Z     ) -> None:
2025-05-07T20:32:41.5477744Z         torch.manual_seed(2025)
2025-05-07T20:32:41.5477978Z     
2025-05-07T20:32:41.5478256Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.5478601Z     
2025-05-07T20:32:41.5478789Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.5479069Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.5479378Z         x = x_sign * x_clamp
2025-05-07T20:32:41.5479607Z         x0 = x[:, :D]
2025-05-07T20:32:41.5479824Z         x1 = x[:, D:]
2025-05-07T20:32:41.5480031Z     
2025-05-07T20:32:41.5480220Z         if contiguous:
2025-05-07T20:32:41.5480458Z             x0 = x0.contiguous()
2025-05-07T20:32:41.5480710Z             x1 = x1.contiguous()
2025-05-07T20:32:41.5480956Z     
2025-05-07T20:32:41.5481151Z         if scale_ub is not None:
2025-05-07T20:32:41.5481421Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.5481766Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.5482068Z             )
2025-05-07T20:32:41.5482258Z         else:
2025-05-07T20:32:41.5482466Z             scale_ub_tensor = None
2025-05-07T20:32:41.5482708Z     
2025-05-07T20:32:41.5482931Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.5483233Z             op = silu_mul_quant
2025-05-07T20:32:41.5483473Z             if compiled:
2025-05-07T20:32:41.5483717Z                 op = torch.compile(op)
2025-05-07T20:32:41.5484004Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.5484265Z     
2025-05-07T20:32:41.5484454Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.5484744Z 
2025-05-07T20:32:41.5484845Z moe/activation_test.py:117: 
2025-05-07T20:32:41.5485139Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.5485469Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.5485743Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.5486302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.5486847Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.5487500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.5488175Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.5488712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.5489382Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.5490044Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.5490566Z     kernel = self.compile(
2025-05-07T20:32:41.5491102Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.5491745Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.5492147Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.5492370Z 
2025-05-07T20:32:41.5492580Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c4f48850>
2025-05-07T20:32:41.5493671Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.5495040Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c48c72e0>}
2025-05-07T20:32:41.5496476Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.5497482Z context = <triton._C.libtriton.ir.context object at 0x7f13c4a6feb0>
2025-05-07T20:32:41.5497764Z 
2025-05-07T20:32:41.5497932Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.5498444Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.5498913Z                            module_map=module_map)
2025-05-07T20:32:41.5499269Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.5499616Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.5499877Z E       ^
2025-05-07T20:32:41.5500335Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.5500780Z 
2025-05-07T20:32:41.5501220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.5501724Z 
2025-05-07T20:32:41.5501825Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.5502231Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.5502620Z     T=16384,
2025-05-07T20:32:41.5502807Z     D=7168,
2025-05-07T20:32:41.5502994Z     scale_ub=1200.0,
2025-05-07T20:32:41.5503207Z     contiguous=True,
2025-05-07T20:32:41.5503428Z     compiled=True,
2025-05-07T20:32:41.5503630Z )
2025-05-07T20:32:41.5503941Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.5504425Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:41.5504693Z 
2025-05-07T20:32:41.5504858Z     @given(
2025-05-07T20:32:41.5505079Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.5505387Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.5505689Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.5506012Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.5506339Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.5506619Z     )
2025-05-07T20:32:41.5506985Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.5507550Z     def test_silu_mul_quant(
2025-05-07T20:32:41.5507801Z         self,
2025-05-07T20:32:41.5507983Z         T: int,
2025-05-07T20:32:41.5508173Z         D: int,
2025-05-07T20:32:41.5508384Z         scale_ub: Optional[float],
2025-05-07T20:32:41.5508641Z         contiguous: bool,
2025-05-07T20:32:41.5508875Z         compiled: bool,
2025-05-07T20:32:41.5509094Z     ) -> None:
2025-05-07T20:32:41.5509297Z         torch.manual_seed(2025)
2025-05-07T20:32:41.5509544Z     
2025-05-07T20:32:41.5509807Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.5510139Z     
2025-05-07T20:32:41.5510329Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.5510614Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.5510912Z         x = x_sign * x_clamp
2025-05-07T20:32:41.5511140Z         x0 = x[:, :D]
2025-05-07T20:32:41.5511349Z         x1 = x[:, D:]
2025-05-07T20:32:41.5511555Z     
2025-05-07T20:32:41.5511728Z         if contiguous:
2025-05-07T20:32:41.5511952Z             x0 = x0.contiguous()
2025-05-07T20:32:41.5512196Z             x1 = x1.contiguous()
2025-05-07T20:32:41.5512420Z     
2025-05-07T20:32:41.5512602Z         if scale_ub is not None:
2025-05-07T20:32:41.5512869Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.5513191Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.5513490Z             )
2025-05-07T20:32:41.5513676Z         else:
2025-05-07T20:32:41.5513877Z             scale_ub_tensor = None
2025-05-07T20:32:41.5514120Z     
2025-05-07T20:32:41.5514342Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.5514733Z             op = silu_mul_quant
2025-05-07T20:32:41.5514976Z             if compiled:
2025-05-07T20:32:41.5515217Z                 op = torch.compile(op)
2025-05-07T20:32:41.5515505Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.5515765Z     
2025-05-07T20:32:41.5515946Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.5516105Z 
2025-05-07T20:32:41.5516212Z moe/activation_test.py:117: 
2025-05-07T20:32:41.5516494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.5516812Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.5517085Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.5517629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.5518181Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.5518835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.5519504Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.5520028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.5520699Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.5521352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.5521878Z     kernel = self.compile(
2025-05-07T20:32:41.5522414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.5523053Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.5523443Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.5523751Z 
2025-05-07T20:32:41.5523951Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c5d5e250>
2025-05-07T20:32:41.5525074Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.5526417Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c4aaca40>}
2025-05-07T20:32:41.5527779Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.5528783Z context = <triton._C.libtriton.ir.context object at 0x7f13c4a54fb0>
2025-05-07T20:32:41.5529068Z 
2025-05-07T20:32:41.5529228Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.5529745Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.5530213Z                            module_map=module_map)
2025-05-07T20:32:41.5530564Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.5530908Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.5531161Z E       ^
2025-05-07T20:32:41.5531609Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.5532045Z 
2025-05-07T20:32:41.5532460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.6680868Z 
2025-05-07T20:32:41.6681207Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.6681807Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.6682371Z     T=16384,
2025-05-07T20:32:41.6682628Z     D=5120,
2025-05-07T20:32:41.6682940Z     scale_ub=1200.0,
2025-05-07T20:32:41.6683353Z     contiguous=True,
2025-05-07T20:32:41.6683579Z     compiled=False,
2025-05-07T20:32:41.6683778Z )
2025-05-07T20:32:41.6684100Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.6684592Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:41.6684879Z 
2025-05-07T20:32:41.6684951Z     @given(
2025-05-07T20:32:41.6685170Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.6685473Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.6685762Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.6686077Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.6686403Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.6686666Z     )
2025-05-07T20:32:41.6687061Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.6687518Z     def test_silu_mul_quant(
2025-05-07T20:32:41.6687757Z         self,
2025-05-07T20:32:41.6687943Z         T: int,
2025-05-07T20:32:41.6688130Z         D: int,
2025-05-07T20:32:41.6688335Z         scale_ub: Optional[float],
2025-05-07T20:32:41.6688599Z         contiguous: bool,
2025-05-07T20:32:41.6688822Z         compiled: bool,
2025-05-07T20:32:41.6689037Z     ) -> None:
2025-05-07T20:32:41.6689242Z         torch.manual_seed(2025)
2025-05-07T20:32:41.6689474Z     
2025-05-07T20:32:41.6689735Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.6690065Z     
2025-05-07T20:32:41.6690261Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.6690549Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.6690841Z         x = x_sign * x_clamp
2025-05-07T20:32:41.6691077Z         x0 = x[:, :D]
2025-05-07T20:32:41.6691368Z         x1 = x[:, D:]
2025-05-07T20:32:41.6691848Z     
2025-05-07T20:32:41.6692030Z         if contiguous:
2025-05-07T20:32:41.6692262Z             x0 = x0.contiguous()
2025-05-07T20:32:41.6692513Z             x1 = x1.contiguous()
2025-05-07T20:32:41.6692743Z     
2025-05-07T20:32:41.6692935Z         if scale_ub is not None:
2025-05-07T20:32:41.6693198Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.6693517Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.6693813Z             )
2025-05-07T20:32:41.6693998Z         else:
2025-05-07T20:32:41.6694210Z             scale_ub_tensor = None
2025-05-07T20:32:41.6694445Z     
2025-05-07T20:32:41.6694667Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.6694966Z             op = silu_mul_quant
2025-05-07T20:32:41.6695213Z             if compiled:
2025-05-07T20:32:41.6695461Z                 op = torch.compile(op)
2025-05-07T20:32:41.6695748Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.6696029Z     
2025-05-07T20:32:41.6696210Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.6696368Z 
2025-05-07T20:32:41.6696469Z moe/activation_test.py:117: 
2025-05-07T20:32:41.6696750Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.6697065Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.6697338Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.6698014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.6698690Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.6699217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.6699908Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.6700557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.6701088Z     kernel = self.compile(
2025-05-07T20:32:41.6701713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.6702360Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.6702748Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.6702986Z 
2025-05-07T20:32:41.6703193Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c4badb50>
2025-05-07T20:32:41.6704252Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.6705603Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c4aad440>}
2025-05-07T20:32:41.6706942Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.6708113Z context = <triton._C.libtriton.ir.context object at 0x7f13c4ab9630>
2025-05-07T20:32:41.6708391Z 
2025-05-07T20:32:41.6708558Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.6709069Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.6709515Z                            module_map=module_map)
2025-05-07T20:32:41.6709879Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.6710228Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.6710464Z E       ^
2025-05-07T20:32:41.6710914Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.6711437Z 
2025-05-07T20:32:41.6711852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.6712357Z 
2025-05-07T20:32:41.6712459Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.6712853Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.6713253Z     T=1,
2025-05-07T20:32:41.6713429Z     D=7168,
2025-05-07T20:32:41.6713607Z     scale_ub=1200.0,
2025-05-07T20:32:41.6713816Z     contiguous=False,
2025-05-07T20:32:41.6714029Z     compiled=False,
2025-05-07T20:32:41.6714221Z )
2025-05-07T20:32:41.6714528Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.6715005Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:41.6715265Z 
2025-05-07T20:32:41.6715344Z     @given(
2025-05-07T20:32:41.6715569Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.6715878Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.6716166Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.6716482Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.6716797Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.6717070Z     )
2025-05-07T20:32:41.6717398Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.6717822Z     def test_silu_mul_quant(
2025-05-07T20:32:41.6718051Z         self,
2025-05-07T20:32:41.6718232Z         T: int,
2025-05-07T20:32:41.6718418Z         D: int,
2025-05-07T20:32:41.6718622Z         scale_ub: Optional[float],
2025-05-07T20:32:41.6718880Z         contiguous: bool,
2025-05-07T20:32:41.6719112Z         compiled: bool,
2025-05-07T20:32:41.6719320Z     ) -> None:
2025-05-07T20:32:41.6719515Z         torch.manual_seed(2025)
2025-05-07T20:32:41.6719747Z     
2025-05-07T20:32:41.6720019Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.6720346Z     
2025-05-07T20:32:41.6720524Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.6720804Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.6721190Z         x = x_sign * x_clamp
2025-05-07T20:32:41.6721412Z         x0 = x[:, :D]
2025-05-07T20:32:41.6721616Z         x1 = x[:, D:]
2025-05-07T20:32:41.6721811Z     
2025-05-07T20:32:41.6721977Z         if contiguous:
2025-05-07T20:32:41.6722199Z             x0 = x0.contiguous()
2025-05-07T20:32:41.6722440Z             x1 = x1.contiguous()
2025-05-07T20:32:41.6722666Z     
2025-05-07T20:32:41.6722851Z         if scale_ub is not None:
2025-05-07T20:32:41.6723122Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.6723443Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.6723742Z             )
2025-05-07T20:32:41.6723923Z         else:
2025-05-07T20:32:41.6724118Z             scale_ub_tensor = None
2025-05-07T20:32:41.6724353Z     
2025-05-07T20:32:41.6724574Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.6724877Z             op = silu_mul_quant
2025-05-07T20:32:41.6725115Z             if compiled:
2025-05-07T20:32:41.6732831Z                 op = torch.compile(op)
2025-05-07T20:32:41.6733156Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.6733431Z     
2025-05-07T20:32:41.6733622Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.6733787Z 
2025-05-07T20:32:41.6733886Z moe/activation_test.py:117: 
2025-05-07T20:32:41.6734185Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.6734516Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.6734783Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.6735454Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.6736122Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.6736647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.6737493Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.6738144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.6738661Z     kernel = self.compile(
2025-05-07T20:32:41.6739189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.6739826Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.6740673Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.6740905Z 
2025-05-07T20:32:41.6741111Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c4f0a7d0>
2025-05-07T20:32:41.6742165Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.6743528Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c4aae7a0>}
2025-05-07T20:32:41.6744847Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.6745850Z context = <triton._C.libtriton.ir.context object at 0x7f13c491c2b0>
2025-05-07T20:32:41.6746128Z 
2025-05-07T20:32:41.6746293Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.6746796Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.6747255Z                            module_map=module_map)
2025-05-07T20:32:41.6747689Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.6748024Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.6748271Z E       ^
2025-05-07T20:32:41.6748878Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.6749350Z 
2025-05-07T20:32:41.6749765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.6750262Z 
2025-05-07T20:32:41.6750357Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.6750752Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.6751156Z     T=4096,
2025-05-07T20:32:41.6751332Z     D=7168,
2025-05-07T20:32:41.6751511Z     scale_ub=1200.0,
2025-05-07T20:32:41.6751727Z     contiguous=False,
2025-05-07T20:32:41.6751939Z     compiled=True,
2025-05-07T20:32:41.8333923Z )
2025-05-07T20:32:41.8334519Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.8335236Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:41.8335606Z 
2025-05-07T20:32:41.8335718Z     @given(
2025-05-07T20:32:41.8336022Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.8336389Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.8336689Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.8337012Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.8337343Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.8337625Z     )
2025-05-07T20:32:41.8337972Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.8338426Z     def test_silu_mul_quant(
2025-05-07T20:32:41.8338678Z         self,
2025-05-07T20:32:41.8338868Z         T: int,
2025-05-07T20:32:41.8339065Z         D: int,
2025-05-07T20:32:41.8339280Z         scale_ub: Optional[float],
2025-05-07T20:32:41.8339730Z         contiguous: bool,
2025-05-07T20:32:41.8339970Z         compiled: bool,
2025-05-07T20:32:41.8340492Z     ) -> None:
2025-05-07T20:32:41.8340717Z         torch.manual_seed(2025)
2025-05-07T20:32:41.8340956Z     
2025-05-07T20:32:41.8341259Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.8341590Z     
2025-05-07T20:32:41.8341776Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.8342057Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.8342358Z         x = x_sign * x_clamp
2025-05-07T20:32:41.8342590Z         x0 = x[:, :D]
2025-05-07T20:32:41.8342799Z         x1 = x[:, D:]
2025-05-07T20:32:41.8343005Z     
2025-05-07T20:32:41.8343187Z         if contiguous:
2025-05-07T20:32:41.8343405Z             x0 = x0.contiguous()
2025-05-07T20:32:41.8343650Z             x1 = x1.contiguous()
2025-05-07T20:32:41.8343893Z     
2025-05-07T20:32:41.8344079Z         if scale_ub is not None:
2025-05-07T20:32:41.8344358Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.8344685Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.8344994Z             )
2025-05-07T20:32:41.8345184Z         else:
2025-05-07T20:32:41.8345391Z             scale_ub_tensor = None
2025-05-07T20:32:41.8345632Z     
2025-05-07T20:32:41.8345852Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.8346166Z             op = silu_mul_quant
2025-05-07T20:32:41.8346406Z             if compiled:
2025-05-07T20:32:41.8346641Z                 op = torch.compile(op)
2025-05-07T20:32:41.8346955Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.8347258Z     
2025-05-07T20:32:41.8347534Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.8347696Z 
2025-05-07T20:32:41.8347791Z moe/activation_test.py:117: 
2025-05-07T20:32:41.8348071Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.8348395Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.8348661Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.8349351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.8349919Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.8350563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.8351244Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.8351776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.8352442Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.8353100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.8353622Z     kernel = self.compile(
2025-05-07T20:32:41.8354173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.8354833Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.8355225Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.8355466Z 
2025-05-07T20:32:41.8355672Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c4f4a050>
2025-05-07T20:32:41.8356735Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.8358094Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c4aafa60>}
2025-05-07T20:32:41.8359404Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.8360542Z context = <triton._C.libtriton.ir.context object at 0x7f13c491b5f0>
2025-05-07T20:32:41.8360820Z 
2025-05-07T20:32:41.8360989Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.8361501Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.8361961Z                            module_map=module_map)
2025-05-07T20:32:41.8362322Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.8362674Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.8362927Z E       ^
2025-05-07T20:32:41.8363383Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.8363817Z 
2025-05-07T20:32:41.8364244Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.8364748Z 
2025-05-07T20:32:41.8364853Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.8365251Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.8365639Z     T=128,
2025-05-07T20:32:41.8365817Z     D=7168,
2025-05-07T20:32:41.8365996Z     scale_ub=1200.0,
2025-05-07T20:32:41.8366210Z     contiguous=False,
2025-05-07T20:32:41.8366425Z     compiled=True,
2025-05-07T20:32:41.8366616Z )
2025-05-07T20:32:41.8366929Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.8367453Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:41.8367714Z 
2025-05-07T20:32:41.8367793Z     @given(
2025-05-07T20:32:41.8368007Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.8368312Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.8368606Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.8368923Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.8369244Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.8369516Z     )
2025-05-07T20:32:41.8369963Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.8370405Z     def test_silu_mul_quant(
2025-05-07T20:32:41.8370634Z         self,
2025-05-07T20:32:41.8370813Z         T: int,
2025-05-07T20:32:41.8371004Z         D: int,
2025-05-07T20:32:41.8371211Z         scale_ub: Optional[float],
2025-05-07T20:32:41.8371472Z         contiguous: bool,
2025-05-07T20:32:41.8371698Z         compiled: bool,
2025-05-07T20:32:41.8371913Z     ) -> None:
2025-05-07T20:32:41.8372115Z         torch.manual_seed(2025)
2025-05-07T20:32:41.8372336Z     
2025-05-07T20:32:41.8372592Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.8372925Z     
2025-05-07T20:32:41.8373102Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.8373390Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.8373687Z         x = x_sign * x_clamp
2025-05-07T20:32:41.8373914Z         x0 = x[:, :D]
2025-05-07T20:32:41.8374129Z         x1 = x[:, D:]
2025-05-07T20:32:41.8374324Z     
2025-05-07T20:32:41.8374497Z         if contiguous:
2025-05-07T20:32:41.8374724Z             x0 = x0.contiguous()
2025-05-07T20:32:41.8374976Z             x1 = x1.contiguous()
2025-05-07T20:32:41.8375200Z     
2025-05-07T20:32:41.8375383Z         if scale_ub is not None:
2025-05-07T20:32:41.8375650Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.8375970Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.8376267Z             )
2025-05-07T20:32:41.8376450Z         else:
2025-05-07T20:32:41.8376646Z             scale_ub_tensor = None
2025-05-07T20:32:41.8376886Z     
2025-05-07T20:32:41.8377110Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.8377465Z             op = silu_mul_quant
2025-05-07T20:32:41.8377837Z             if compiled:
2025-05-07T20:32:41.8378076Z                 op = torch.compile(op)
2025-05-07T20:32:41.8378366Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.8378632Z     
2025-05-07T20:32:41.8378808Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.8378966Z 
2025-05-07T20:32:41.8379066Z moe/activation_test.py:117: 
2025-05-07T20:32:41.8379342Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.8379657Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.8379932Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.8380478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.8381017Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.8381668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.8382340Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.8382870Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.8383535Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.8384184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.8384704Z     kernel = self.compile(
2025-05-07T20:32:41.8385240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.8385879Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.8386262Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.8386478Z 
2025-05-07T20:32:41.8386678Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c44e7ad0>
2025-05-07T20:32:41.8387883Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.8389255Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c47c0d60>}
2025-05-07T20:32:41.8390567Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.8391569Z context = <triton._C.libtriton.ir.context object at 0x7f13c47e5ef0>
2025-05-07T20:32:41.8391846Z 
2025-05-07T20:32:41.8392007Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.8392520Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.8392994Z                            module_map=module_map)
2025-05-07T20:32:41.8393349Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.8393691Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.8393938Z E       ^
2025-05-07T20:32:41.8394392Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.8394831Z 
2025-05-07T20:32:41.8395258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.8395762Z 
2025-05-07T20:32:41.8395861Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.8396261Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.8396649Z     T=2048,
2025-05-07T20:32:41.8396823Z     D=7168,
2025-05-07T20:32:41.8397002Z     scale_ub=None,
2025-05-07T20:32:41.8397204Z     contiguous=True,
2025-05-07T20:32:41.8397409Z     compiled=True,
2025-05-07T20:32:41.9600468Z )
2025-05-07T20:32:41.9601150Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.9601876Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:41.9602228Z 
2025-05-07T20:32:41.9602330Z     @given(
2025-05-07T20:32:41.9602627Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.9602928Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.9603256Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.9603582Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.9603896Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.9604186Z     )
2025-05-07T20:32:41.9604533Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.9604978Z     def test_silu_mul_quant(
2025-05-07T20:32:41.9605223Z         self,
2025-05-07T20:32:41.9605413Z         T: int,
2025-05-07T20:32:41.9605604Z         D: int,
2025-05-07T20:32:41.9605824Z         scale_ub: Optional[float],
2025-05-07T20:32:41.9606087Z         contiguous: bool,
2025-05-07T20:32:41.9606323Z         compiled: bool,
2025-05-07T20:32:41.9606541Z     ) -> None:
2025-05-07T20:32:41.9606755Z         torch.manual_seed(2025)
2025-05-07T20:32:41.9607002Z     
2025-05-07T20:32:41.9607263Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.9607602Z     
2025-05-07T20:32:41.9607791Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.9608069Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.9608374Z         x = x_sign * x_clamp
2025-05-07T20:32:41.9608608Z         x0 = x[:, :D]
2025-05-07T20:32:41.9608815Z         x1 = x[:, D:]
2025-05-07T20:32:41.9609022Z     
2025-05-07T20:32:41.9609207Z         if contiguous:
2025-05-07T20:32:41.9609433Z             x0 = x0.contiguous()
2025-05-07T20:32:41.9609686Z             x1 = x1.contiguous()
2025-05-07T20:32:41.9609928Z     
2025-05-07T20:32:41.9610113Z         if scale_ub is not None:
2025-05-07T20:32:41.9610383Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:41.9610888Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:41.9611203Z             )
2025-05-07T20:32:41.9611393Z         else:
2025-05-07T20:32:41.9611596Z             scale_ub_tensor = None
2025-05-07T20:32:41.9611843Z     
2025-05-07T20:32:41.9612064Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:41.9612375Z             op = silu_mul_quant
2025-05-07T20:32:41.9612623Z             if compiled:
2025-05-07T20:32:41.9612854Z                 op = torch.compile(op)
2025-05-07T20:32:41.9613135Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.9613407Z     
2025-05-07T20:32:41.9613580Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:41.9613740Z 
2025-05-07T20:32:41.9613835Z moe/activation_test.py:117: 
2025-05-07T20:32:41.9614122Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.9614446Z moe/activation_test.py:115: in fn
2025-05-07T20:32:41.9614726Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:41.9615295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:41.9615848Z     return fn(*args, **kwargs)
2025-05-07T20:32:41.9616495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:41.9617215Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:41.9617740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:41.9618410Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:41.9619052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:41.9619574Z     kernel = self.compile(
2025-05-07T20:32:41.9620249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:41.9620891Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:41.9621275Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:41.9621515Z 
2025-05-07T20:32:41.9621718Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c4f0acd0>
2025-05-07T20:32:41.9622793Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:41.9624141Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c47c1b20>}
2025-05-07T20:32:41.9625455Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:41.9626458Z context = <triton._C.libtriton.ir.context object at 0x7f13c47e09b0>
2025-05-07T20:32:41.9626737Z 
2025-05-07T20:32:41.9626910Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:41.9627428Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:41.9627972Z                            module_map=module_map)
2025-05-07T20:32:41.9628344Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:41.9628692Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:41.9628940Z E       ^
2025-05-07T20:32:41.9629403Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:41.9629848Z 
2025-05-07T20:32:41.9630270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:41.9630781Z 
2025-05-07T20:32:41.9630978Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.9631378Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.9631773Z     T=16384,
2025-05-07T20:32:41.9631966Z     D=5120,
2025-05-07T20:32:41.9632147Z     scale_ub=None,
2025-05-07T20:32:41.9632354Z     contiguous=False,
2025-05-07T20:32:41.9632570Z     compiled=False,
2025-05-07T20:32:41.9632762Z )
2025-05-07T20:32:41.9633071Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.9633565Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:41.9633839Z 
2025-05-07T20:32:41.9633915Z     @given(
2025-05-07T20:32:41.9634124Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.9634426Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.9634723Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.9635033Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.9635353Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.9635624Z     )
2025-05-07T20:32:41.9635956Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.9636396Z     def test_silu_mul_quant(
2025-05-07T20:32:41.9636632Z         self,
2025-05-07T20:32:41.9636810Z         T: int,
2025-05-07T20:32:41.9637071Z         D: int,
2025-05-07T20:32:41.9637318Z         scale_ub: Optional[float],
2025-05-07T20:32:41.9637575Z         contiguous: bool,
2025-05-07T20:32:41.9637803Z         compiled: bool,
2025-05-07T20:32:41.9638013Z     ) -> None:
2025-05-07T20:32:41.9638224Z         torch.manual_seed(2025)
2025-05-07T20:32:41.9638451Z     
2025-05-07T20:32:41.9638711Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.9639049Z     
2025-05-07T20:32:41.9639323Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.9639606Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.9641894Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.9643781Z 
2025-05-07T20:32:41.9643899Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:41.9644104Z 
2025-05-07T20:32:41.9644202Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.9644597Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.9645005Z     T=4096,
2025-05-07T20:32:41.9645186Z     D=7168,
2025-05-07T20:32:41.9645359Z     scale_ub=1200.0,
2025-05-07T20:32:41.9645574Z     contiguous=True,
2025-05-07T20:32:41.9645782Z     compiled=True,
2025-05-07T20:32:41.9645974Z )
2025-05-07T20:32:41.9646275Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.9646751Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:41.9647036Z 
2025-05-07T20:32:41.9647119Z     @given(
2025-05-07T20:32:41.9647357Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.9647688Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.9648111Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.9648555Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.9648877Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.9649142Z     )
2025-05-07T20:32:41.9649474Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.9649918Z     def test_silu_mul_quant(
2025-05-07T20:32:41.9650148Z         self,
2025-05-07T20:32:41.9650484Z         T: int,
2025-05-07T20:32:41.9650667Z         D: int,
2025-05-07T20:32:41.9650880Z         scale_ub: Optional[float],
2025-05-07T20:32:41.9651141Z         contiguous: bool,
2025-05-07T20:32:41.9651361Z         compiled: bool,
2025-05-07T20:32:41.9651571Z     ) -> None:
2025-05-07T20:32:41.9651772Z         torch.manual_seed(2025)
2025-05-07T20:32:41.9651993Z     
2025-05-07T20:32:41.9652249Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.9652583Z     
2025-05-07T20:32:41.9652759Z         x_sign = torch.sign(x)
2025-05-07T20:32:41.9653036Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:41.9655008Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.9656905Z 
2025-05-07T20:32:41.9657033Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:41.9657246Z 
2025-05-07T20:32:41.9657349Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:41.9657738Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:41.9658138Z     T=16384,
2025-05-07T20:32:41.9658318Z     D=7168,
2025-05-07T20:32:41.9658491Z     scale_ub=None,
2025-05-07T20:32:41.9658693Z     contiguous=False,
2025-05-07T20:32:41.9658907Z     compiled=False,
2025-05-07T20:32:41.9659095Z )
2025-05-07T20:32:41.9659551Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:41.9660040Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:41.9660324Z 
2025-05-07T20:32:41.9660394Z     @given(
2025-05-07T20:32:41.9660612Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:41.9660914Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:41.9661202Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:41.9661520Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:41.9661836Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:41.9662114Z     )
2025-05-07T20:32:41.9662448Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:41.9670249Z     def test_silu_mul_quant(
2025-05-07T20:32:41.9670517Z         self,
2025-05-07T20:32:41.9670706Z         T: int,
2025-05-07T20:32:41.9670894Z         D: int,
2025-05-07T20:32:41.9671116Z         scale_ub: Optional[float],
2025-05-07T20:32:41.9671378Z         contiguous: bool,
2025-05-07T20:32:41.9671617Z         compiled: bool,
2025-05-07T20:32:41.9671840Z     ) -> None:
2025-05-07T20:32:41.9672050Z         torch.manual_seed(2025)
2025-05-07T20:32:41.9672287Z     
2025-05-07T20:32:41.9672554Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:41.9674573Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:41.9676424Z 
2025-05-07T20:32:41.9676539Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.0895565Z 
2025-05-07T20:32:42.0895810Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.0896715Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.0897805Z     T=2048,
2025-05-07T20:32:42.0898302Z     D=7168,
2025-05-07T20:32:42.0898824Z     scale_ub=1200.0,
2025-05-07T20:32:42.0899425Z     contiguous=True,
2025-05-07T20:32:42.0899893Z     compiled=True,
2025-05-07T20:32:42.0900286Z )
2025-05-07T20:32:42.0900907Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.0901887Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.0902431Z 
2025-05-07T20:32:42.0902592Z     @given(
2025-05-07T20:32:42.0903020Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.0903622Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.0904212Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.0904858Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.0905484Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.0906045Z     )
2025-05-07T20:32:42.0906719Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.0907319Z     def test_silu_mul_quant(
2025-05-07T20:32:42.0907646Z         self,
2025-05-07T20:32:42.0907839Z         T: int,
2025-05-07T20:32:42.0908030Z         D: int,
2025-05-07T20:32:42.0908255Z         scale_ub: Optional[float],
2025-05-07T20:32:42.0908519Z         contiguous: bool,
2025-05-07T20:32:42.0908751Z         compiled: bool,
2025-05-07T20:32:42.0908973Z     ) -> None:
2025-05-07T20:32:42.0909182Z         torch.manual_seed(2025)
2025-05-07T20:32:42.0909420Z     
2025-05-07T20:32:42.0909689Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.0910028Z     
2025-05-07T20:32:42.0910219Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.0910644Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.0912622Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.0914551Z 
2025-05-07T20:32:42.0914669Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:42.0914876Z 
2025-05-07T20:32:42.0914984Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.0915382Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.0915795Z     T=2048,
2025-05-07T20:32:42.0915980Z     D=7168,
2025-05-07T20:32:42.0916165Z     scale_ub=None,
2025-05-07T20:32:42.0916366Z     contiguous=True,
2025-05-07T20:32:42.0916590Z     compiled=False,
2025-05-07T20:32:42.0916789Z )
2025-05-07T20:32:42.0917096Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.0917570Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.0917846Z 
2025-05-07T20:32:42.0917927Z     @given(
2025-05-07T20:32:42.0918140Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.0918442Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.0918736Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.0919046Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.0919372Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.0919646Z     )
2025-05-07T20:32:42.0919987Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.0920430Z     def test_silu_mul_quant(
2025-05-07T20:32:42.0920662Z         self,
2025-05-07T20:32:42.0920857Z         T: int,
2025-05-07T20:32:42.0921126Z         D: int,
2025-05-07T20:32:42.0921335Z         scale_ub: Optional[float],
2025-05-07T20:32:42.0921599Z         contiguous: bool,
2025-05-07T20:32:42.0921823Z         compiled: bool,
2025-05-07T20:32:42.0922036Z     ) -> None:
2025-05-07T20:32:42.0922240Z         torch.manual_seed(2025)
2025-05-07T20:32:42.0922477Z     
2025-05-07T20:32:42.0922736Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.0923063Z     
2025-05-07T20:32:42.0923243Z >       x_sign = torch.sign(x)
2025-05-07T20:32:42.0925146Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.0927061Z 
2025-05-07T20:32:42.0927173Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:42.0927381Z 
2025-05-07T20:32:42.0927476Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.0927874Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.0928271Z     T=1,
2025-05-07T20:32:42.0928446Z     D=7168,
2025-05-07T20:32:42.0928628Z     scale_ub=1200.0,
2025-05-07T20:32:42.0928830Z     contiguous=True,
2025-05-07T20:32:42.0929038Z     compiled=False,
2025-05-07T20:32:42.0929229Z )
2025-05-07T20:32:42.0929527Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.0929994Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.0930345Z 
2025-05-07T20:32:42.0930419Z     @given(
2025-05-07T20:32:42.0930643Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.0930944Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.0931240Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.0931556Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.0931868Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.0932145Z     )
2025-05-07T20:32:42.0932483Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.0932905Z     def test_silu_mul_quant(
2025-05-07T20:32:42.0933136Z         self,
2025-05-07T20:32:42.0933328Z         T: int,
2025-05-07T20:32:42.0933519Z         D: int,
2025-05-07T20:32:42.0933729Z         scale_ub: Optional[float],
2025-05-07T20:32:42.0933991Z         contiguous: bool,
2025-05-07T20:32:42.0934221Z         compiled: bool,
2025-05-07T20:32:42.0934430Z     ) -> None:
2025-05-07T20:32:42.0934645Z         torch.manual_seed(2025)
2025-05-07T20:32:42.0934888Z     
2025-05-07T20:32:42.0935144Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.0935471Z     
2025-05-07T20:32:42.0935658Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.0935935Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.0936231Z         x = x_sign * x_clamp
2025-05-07T20:32:42.0936459Z         x0 = x[:, :D]
2025-05-07T20:32:42.0936659Z         x1 = x[:, D:]
2025-05-07T20:32:42.0936865Z     
2025-05-07T20:32:42.0937046Z         if contiguous:
2025-05-07T20:32:42.0937299Z             x0 = x0.contiguous()
2025-05-07T20:32:42.0937560Z             x1 = x1.contiguous()
2025-05-07T20:32:42.0937786Z     
2025-05-07T20:32:42.0937966Z         if scale_ub is not None:
2025-05-07T20:32:42.0938232Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.0938567Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.0938873Z             )
2025-05-07T20:32:42.0939071Z         else:
2025-05-07T20:32:42.0939360Z             scale_ub_tensor = None
2025-05-07T20:32:42.0939605Z     
2025-05-07T20:32:42.0939824Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.0940379Z             op = silu_mul_quant
2025-05-07T20:32:42.0940625Z             if compiled:
2025-05-07T20:32:42.0940856Z                 op = torch.compile(op)
2025-05-07T20:32:42.0941149Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.0941409Z     
2025-05-07T20:32:42.0941592Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.0941763Z 
2025-05-07T20:32:42.0941857Z moe/activation_test.py:117: 
2025-05-07T20:32:42.0942139Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.0942453Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.0942728Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.0943414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.0944098Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.0944620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.0945285Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.0945937Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.0946456Z     kernel = self.compile(
2025-05-07T20:32:42.0946988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.0947681Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.0948063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.0948413Z 
2025-05-07T20:32:42.0948612Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c497a0d0>
2025-05-07T20:32:42.0949695Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.0951038Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c4528a40>}
2025-05-07T20:32:42.0952352Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.0953350Z context = <triton._C.libtriton.ir.context object at 0x7f13c4569a30>
2025-05-07T20:32:42.0953629Z 
2025-05-07T20:32:42.0953790Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.0954305Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.0954767Z                            module_map=module_map)
2025-05-07T20:32:42.0955115Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.0955466Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.0955719Z E       ^
2025-05-07T20:32:42.0956165Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.0956600Z 
2025-05-07T20:32:42.0957003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.0957517Z 
2025-05-07T20:32:42.0957616Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.0958015Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.0958417Z     T=128,
2025-05-07T20:32:42.0958597Z     D=5120,
2025-05-07T20:32:42.0958781Z     scale_ub=None,
2025-05-07T20:32:42.0958991Z     contiguous=True,
2025-05-07T20:32:42.0959197Z     compiled=False,
2025-05-07T20:32:42.0959511Z )
2025-05-07T20:32:42.0959822Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.0960294Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.0960560Z 
2025-05-07T20:32:42.0960636Z     @given(
2025-05-07T20:32:42.0960857Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.0961156Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.0961459Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.0961776Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.0962090Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.0962361Z     )
2025-05-07T20:32:42.0962729Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.0963186Z     def test_silu_mul_quant(
2025-05-07T20:32:42.0963419Z         self,
2025-05-07T20:32:42.0963598Z         T: int,
2025-05-07T20:32:42.0963784Z         D: int,
2025-05-07T20:32:42.0964002Z         scale_ub: Optional[float],
2025-05-07T20:32:42.0964260Z         contiguous: bool,
2025-05-07T20:32:42.0964493Z         compiled: bool,
2025-05-07T20:32:42.0964708Z     ) -> None:
2025-05-07T20:32:42.0964908Z         torch.manual_seed(2025)
2025-05-07T20:32:42.0965140Z     
2025-05-07T20:32:42.0965406Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.0965736Z     
2025-05-07T20:32:42.0965919Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.0966202Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.0966495Z         x = x_sign * x_clamp
2025-05-07T20:32:42.0966726Z         x0 = x[:, :D]
2025-05-07T20:32:42.0966948Z         x1 = x[:, D:]
2025-05-07T20:32:42.0967179Z     
2025-05-07T20:32:42.0967370Z         if contiguous:
2025-05-07T20:32:42.0967709Z             x0 = x0.contiguous()
2025-05-07T20:32:42.0967955Z             x1 = x1.contiguous()
2025-05-07T20:32:42.0968183Z     
2025-05-07T20:32:42.0968368Z         if scale_ub is not None:
2025-05-07T20:32:42.0968626Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.0968946Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.0969235Z             )
2025-05-07T20:32:42.0969419Z         else:
2025-05-07T20:32:42.0969613Z             scale_ub_tensor = None
2025-05-07T20:32:42.0969851Z     
2025-05-07T20:32:42.0970074Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.0970369Z             op = silu_mul_quant
2025-05-07T20:32:42.0970608Z             if compiled:
2025-05-07T20:32:42.0970840Z                 op = torch.compile(op)
2025-05-07T20:32:42.0971121Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.0971382Z     
2025-05-07T20:32:42.0971566Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.0971733Z 
2025-05-07T20:32:42.0971827Z moe/activation_test.py:117: 
2025-05-07T20:32:42.0972110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.0972431Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.0972700Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.0973369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.0974037Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.0974563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.0975221Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.0975869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.0976391Z     kernel = self.compile(
2025-05-07T20:32:42.0976952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.0977668Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.0978056Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.0978278Z 
2025-05-07T20:32:42.0978485Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c5d5f150>
2025-05-07T20:32:42.0979539Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.0980875Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c4529940>}
2025-05-07T20:32:42.0982181Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.0983191Z context = <triton._C.libtriton.ir.context object at 0x7f13c4320670>
2025-05-07T20:32:42.0983470Z 
2025-05-07T20:32:42.0983637Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.0984142Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.0984594Z                            module_map=module_map)
2025-05-07T20:32:42.0984947Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.0985289Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.0985534Z E       ^
2025-05-07T20:32:42.0985979Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.0986417Z 
2025-05-07T20:32:42.0986844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.2098699Z 
2025-05-07T20:32:42.2099165Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2099846Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2100382Z     T=128,
2025-05-07T20:32:42.2100634Z     D=7168,
2025-05-07T20:32:42.2100880Z     scale_ub=None,
2025-05-07T20:32:42.2101144Z     contiguous=True,
2025-05-07T20:32:42.2101429Z     compiled=False,
2025-05-07T20:32:42.2101699Z )
2025-05-07T20:32:42.2102077Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2102562Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.2102840Z 
2025-05-07T20:32:42.2102920Z     @given(
2025-05-07T20:32:42.2103150Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2103452Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2103761Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2104094Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2104410Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2104688Z     )
2025-05-07T20:32:42.2105026Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2105464Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2105728Z         self,
2025-05-07T20:32:42.2105911Z         T: int,
2025-05-07T20:32:42.2106105Z         D: int,
2025-05-07T20:32:42.2106335Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2106604Z         contiguous: bool,
2025-05-07T20:32:42.2106841Z         compiled: bool,
2025-05-07T20:32:42.2107070Z     ) -> None:
2025-05-07T20:32:42.2107276Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2107589Z     
2025-05-07T20:32:42.2107859Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2108179Z     
2025-05-07T20:32:42.2108355Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.2108639Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.2108932Z         x = x_sign * x_clamp
2025-05-07T20:32:42.2109345Z         x0 = x[:, :D]
2025-05-07T20:32:42.2109559Z         x1 = x[:, D:]
2025-05-07T20:32:42.2109764Z     
2025-05-07T20:32:42.2109938Z         if contiguous:
2025-05-07T20:32:42.2110161Z             x0 = x0.contiguous()
2025-05-07T20:32:42.2110407Z             x1 = x1.contiguous()
2025-05-07T20:32:42.2110632Z     
2025-05-07T20:32:42.2110819Z         if scale_ub is not None:
2025-05-07T20:32:42.2111089Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.2111414Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.2111712Z             )
2025-05-07T20:32:42.2111897Z         else:
2025-05-07T20:32:42.2112094Z             scale_ub_tensor = None
2025-05-07T20:32:42.2112333Z     
2025-05-07T20:32:42.2112552Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.2112853Z             op = silu_mul_quant
2025-05-07T20:32:42.2113090Z             if compiled:
2025-05-07T20:32:42.2113328Z                 op = torch.compile(op)
2025-05-07T20:32:42.2113609Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2113871Z     
2025-05-07T20:32:42.2114056Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.2114216Z 
2025-05-07T20:32:42.2114316Z moe/activation_test.py:117: 
2025-05-07T20:32:42.2114600Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2114923Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.2115195Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2115899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.2116570Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.2117097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.2117899Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.2118569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.2119085Z     kernel = self.compile(
2025-05-07T20:32:42.2119627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.2120261Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.2120646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2120876Z 
2025-05-07T20:32:42.2121076Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c49786d0>
2025-05-07T20:32:42.2122132Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.2123497Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c452a700>}
2025-05-07T20:32:42.2124807Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.2125852Z context = <triton._C.libtriton.ir.context object at 0x7f13c4363fb0>
2025-05-07T20:32:42.2126129Z 
2025-05-07T20:32:42.2126296Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.2126799Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.2127256Z                            module_map=module_map)
2025-05-07T20:32:42.2127614Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.2127956Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.2128202Z E       ^
2025-05-07T20:32:42.2128735Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.2129173Z 
2025-05-07T20:32:42.2129589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.2130088Z 
2025-05-07T20:32:42.2130194Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2130590Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2130982Z     T=2048,
2025-05-07T20:32:42.2131157Z     D=7168,
2025-05-07T20:32:42.2131333Z     scale_ub=1200.0,
2025-05-07T20:32:42.2131546Z     contiguous=True,
2025-05-07T20:32:42.2131761Z     compiled=False,
2025-05-07T20:32:42.2131949Z )
2025-05-07T20:32:42.2132255Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2132744Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.2133003Z 
2025-05-07T20:32:42.2133081Z     @given(
2025-05-07T20:32:42.2133302Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2133602Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2133897Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2134209Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2134523Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2134795Z     )
2025-05-07T20:32:42.2135126Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2135546Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2135776Z         self,
2025-05-07T20:32:42.2135951Z         T: int,
2025-05-07T20:32:42.2136132Z         D: int,
2025-05-07T20:32:42.2136339Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2136595Z         contiguous: bool,
2025-05-07T20:32:42.2136906Z         compiled: bool,
2025-05-07T20:32:42.2137119Z     ) -> None:
2025-05-07T20:32:42.2137359Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2137596Z     
2025-05-07T20:32:42.2137859Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2139873Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.2142047Z 
2025-05-07T20:32:42.2142159Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.2142361Z 
2025-05-07T20:32:42.2142472Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2142865Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2143264Z     T=1,
2025-05-07T20:32:42.2143431Z     D=5120,
2025-05-07T20:32:42.2143606Z     scale_ub=1200.0,
2025-05-07T20:32:42.2143820Z     contiguous=True,
2025-05-07T20:32:42.2144038Z     compiled=False,
2025-05-07T20:32:42.2144230Z )
2025-05-07T20:32:42.2144529Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2144996Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.2145249Z 
2025-05-07T20:32:42.2145331Z     @given(
2025-05-07T20:32:42.2145541Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2145837Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2146132Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2146445Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2146762Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2147037Z     )
2025-05-07T20:32:42.2147581Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2148008Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2148236Z         self,
2025-05-07T20:32:42.2148420Z         T: int,
2025-05-07T20:32:42.2148599Z         D: int,
2025-05-07T20:32:42.2156536Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2156825Z         contiguous: bool,
2025-05-07T20:32:42.2157068Z         compiled: bool,
2025-05-07T20:32:42.2157283Z     ) -> None:
2025-05-07T20:32:42.2157493Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2157723Z     
2025-05-07T20:32:42.2157981Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2158319Z     
2025-05-07T20:32:42.2158509Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.2158791Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.2159094Z         x = x_sign * x_clamp
2025-05-07T20:32:42.2159323Z         x0 = x[:, :D]
2025-05-07T20:32:42.2159526Z         x1 = x[:, D:]
2025-05-07T20:32:42.2159728Z     
2025-05-07T20:32:42.2159913Z         if contiguous:
2025-05-07T20:32:42.2160130Z             x0 = x0.contiguous()
2025-05-07T20:32:42.2160375Z             x1 = x1.contiguous()
2025-05-07T20:32:42.2160603Z     
2025-05-07T20:32:42.2160775Z         if scale_ub is not None:
2025-05-07T20:32:42.2161029Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.2161352Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.2161650Z             )
2025-05-07T20:32:42.2161828Z         else:
2025-05-07T20:32:42.2162026Z             scale_ub_tensor = None
2025-05-07T20:32:42.2162263Z     
2025-05-07T20:32:42.2162481Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.2162782Z             op = silu_mul_quant
2025-05-07T20:32:42.2163025Z             if compiled:
2025-05-07T20:32:42.2163255Z                 op = torch.compile(op)
2025-05-07T20:32:42.2163703Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2163965Z     
2025-05-07T20:32:42.2164149Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.2164314Z 
2025-05-07T20:32:42.2164407Z moe/activation_test.py:117: 
2025-05-07T20:32:42.2164694Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2165012Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.2165281Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.2165959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.2166633Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.2167157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.2167821Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.2168479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.2168993Z     kernel = self.compile(
2025-05-07T20:32:42.2169536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.2170211Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.2170596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.2170820Z 
2025-05-07T20:32:42.2171027Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c5d5f050>
2025-05-07T20:32:42.2172083Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.2173436Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c452bce0>}
2025-05-07T20:32:42.2174840Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.2175894Z context = <triton._C.libtriton.ir.context object at 0x7f13c43b73b0>
2025-05-07T20:32:42.2176174Z 
2025-05-07T20:32:42.2176337Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.2176848Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.2177304Z                            module_map=module_map)
2025-05-07T20:32:42.2177665Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.2178008Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.2178254Z E       ^
2025-05-07T20:32:42.2178704Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.2179143Z 
2025-05-07T20:32:42.2179554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.2983154Z 
2025-05-07T20:32:42.2983416Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2984033Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2984585Z     T=2048,
2025-05-07T20:32:42.2984833Z     D=5120,
2025-05-07T20:32:42.2985076Z     scale_ub=None,
2025-05-07T20:32:42.2985357Z     contiguous=True,
2025-05-07T20:32:42.2985653Z     compiled=False,
2025-05-07T20:32:42.2985882Z )
2025-05-07T20:32:42.2986197Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2986681Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.2986946Z 
2025-05-07T20:32:42.2987221Z     @given(
2025-05-07T20:32:42.2987528Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.2987831Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.2988147Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.2988472Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.2988786Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.2989064Z     )
2025-05-07T20:32:42.2989397Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.2989842Z     def test_silu_mul_quant(
2025-05-07T20:32:42.2990068Z         self,
2025-05-07T20:32:42.2990249Z         T: int,
2025-05-07T20:32:42.2990450Z         D: int,
2025-05-07T20:32:42.2990666Z         scale_ub: Optional[float],
2025-05-07T20:32:42.2990928Z         contiguous: bool,
2025-05-07T20:32:42.2991154Z         compiled: bool,
2025-05-07T20:32:42.2991367Z     ) -> None:
2025-05-07T20:32:42.2991568Z         torch.manual_seed(2025)
2025-05-07T20:32:42.2991834Z     
2025-05-07T20:32:42.2992095Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.2992435Z     
2025-05-07T20:32:42.2992626Z >       x_sign = torch.sign(x)
2025-05-07T20:32:42.2994534Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.2996360Z 
2025-05-07T20:32:42.2996474Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:42.2996680Z 
2025-05-07T20:32:42.2996779Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.2997193Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.2997586Z     T=16384,
2025-05-07T20:32:42.2997902Z     D=5120,
2025-05-07T20:32:42.2998086Z     scale_ub=None,
2025-05-07T20:32:42.2998302Z     contiguous=True,
2025-05-07T20:32:42.2998512Z     compiled=False,
2025-05-07T20:32:42.2998718Z )
2025-05-07T20:32:42.2999032Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.2999513Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.2999792Z 
2025-05-07T20:32:42.2999866Z     @given(
2025-05-07T20:32:42.3000083Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.3000385Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.3000681Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.3001006Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.3001331Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.3001618Z     )
2025-05-07T20:32:42.3001959Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.3002402Z     def test_silu_mul_quant(
2025-05-07T20:32:42.3002633Z         self,
2025-05-07T20:32:42.3002818Z         T: int,
2025-05-07T20:32:42.3003003Z         D: int,
2025-05-07T20:32:42.3003208Z         scale_ub: Optional[float],
2025-05-07T20:32:42.3003476Z         contiguous: bool,
2025-05-07T20:32:42.3003703Z         compiled: bool,
2025-05-07T20:32:42.3003908Z     ) -> None:
2025-05-07T20:32:42.3004123Z         torch.manual_seed(2025)
2025-05-07T20:32:42.3004367Z     
2025-05-07T20:32:42.3004624Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.3006639Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.3008630Z 
2025-05-07T20:32:42.3008743Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.3008950Z 
2025-05-07T20:32:42.3009049Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.3009461Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.3009860Z     T=4096,
2025-05-07T20:32:42.3010034Z     D=5120,
2025-05-07T20:32:42.3010216Z     scale_ub=None,
2025-05-07T20:32:42.3010429Z     contiguous=True,
2025-05-07T20:32:42.3010654Z     compiled=False,
2025-05-07T20:32:42.3010853Z )
2025-05-07T20:32:42.3011163Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.3011646Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.3011920Z 
2025-05-07T20:32:42.3011995Z     @given(
2025-05-07T20:32:42.3012226Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.3012528Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.3012833Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.3013152Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.3013457Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.3013733Z     )
2025-05-07T20:32:42.3014072Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.3014512Z     def test_silu_mul_quant(
2025-05-07T20:32:42.3014742Z         self,
2025-05-07T20:32:42.3014930Z         T: int,
2025-05-07T20:32:42.3015116Z         D: int,
2025-05-07T20:32:42.3015320Z         scale_ub: Optional[float],
2025-05-07T20:32:42.3015576Z         contiguous: bool,
2025-05-07T20:32:42.3015819Z         compiled: bool,
2025-05-07T20:32:42.3016026Z     ) -> None:
2025-05-07T20:32:42.3016231Z         torch.manual_seed(2025)
2025-05-07T20:32:42.3016551Z     
2025-05-07T20:32:42.3016813Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.3018821Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.3020733Z 
2025-05-07T20:32:42.3020844Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.3021057Z 
2025-05-07T20:32:42.3021157Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.3021559Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.3021941Z     T=2048,
2025-05-07T20:32:42.3022117Z     D=5120,
2025-05-07T20:32:42.3022297Z     scale_ub=None,
2025-05-07T20:32:42.3022496Z     contiguous=False,
2025-05-07T20:32:42.3022710Z     compiled=False,
2025-05-07T20:32:42.3022905Z )
2025-05-07T20:32:42.3023207Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.3023681Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:42.3023953Z 
2025-05-07T20:32:42.3024030Z     @given(
2025-05-07T20:32:42.3024242Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.3024541Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.3024835Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.3025150Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.3025545Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.3025826Z     )
2025-05-07T20:32:42.3026172Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.3026604Z     def test_silu_mul_quant(
2025-05-07T20:32:42.3026837Z         self,
2025-05-07T20:32:42.3027035Z         T: int,
2025-05-07T20:32:42.3027244Z         D: int,
2025-05-07T20:32:42.3027535Z         scale_ub: Optional[float],
2025-05-07T20:32:42.3027796Z         contiguous: bool,
2025-05-07T20:32:42.3028020Z         compiled: bool,
2025-05-07T20:32:42.3028233Z     ) -> None:
2025-05-07T20:32:42.3028437Z         torch.manual_seed(2025)
2025-05-07T20:32:42.3028669Z     
2025-05-07T20:32:42.3028924Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.3030929Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.3032843Z 
2025-05-07T20:32:42.3032954Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.3033161Z 
2025-05-07T20:32:42.3033265Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.3033658Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.3034058Z     T=4096,
2025-05-07T20:32:42.3034239Z     D=7168,
2025-05-07T20:32:42.3034416Z     scale_ub=None,
2025-05-07T20:32:42.3034617Z     contiguous=True,
2025-05-07T20:32:42.3034832Z     compiled=True,
2025-05-07T20:32:42.3035026Z )
2025-05-07T20:32:42.3035332Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.3035817Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:42.3036192Z 
2025-05-07T20:32:42.3036275Z     @given(
2025-05-07T20:32:42.3036486Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.3036783Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.3037072Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.3037381Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.3037689Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.3037959Z     )
2025-05-07T20:32:42.3038294Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.3038736Z     def test_silu_mul_quant(
2025-05-07T20:32:42.3038964Z         self,
2025-05-07T20:32:42.3039144Z         T: int,
2025-05-07T20:32:42.3039337Z         D: int,
2025-05-07T20:32:42.3039556Z         scale_ub: Optional[float],
2025-05-07T20:32:42.3039807Z         contiguous: bool,
2025-05-07T20:32:42.3040040Z         compiled: bool,
2025-05-07T20:32:42.3040611Z     ) -> None:
2025-05-07T20:32:42.3040831Z         torch.manual_seed(2025)
2025-05-07T20:32:42.3041060Z     
2025-05-07T20:32:42.3041315Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.3043317Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.3045234Z 
2025-05-07T20:32:42.3045500Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.3045707Z 
2025-05-07T20:32:42.3045805Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.3046210Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.3046607Z     T=2048,
2025-05-07T20:32:42.3046794Z     D=5120,
2025-05-07T20:32:42.3046975Z     scale_ub=1200.0,
2025-05-07T20:32:42.3047189Z     contiguous=False,
2025-05-07T20:32:42.3047408Z     compiled=False,
2025-05-07T20:32:42.3592192Z )
2025-05-07T20:32:42.3592584Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.3593256Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.3593584Z 
2025-05-07T20:32:42.3593663Z     @given(
2025-05-07T20:32:42.3593879Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.3594179Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.3594551Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.3594964Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.3595434Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.3595848Z     )
2025-05-07T20:32:42.3596212Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.3596646Z     def test_silu_mul_quant(
2025-05-07T20:32:42.3596885Z         self,
2025-05-07T20:32:42.3597078Z         T: int,
2025-05-07T20:32:42.3597263Z         D: int,
2025-05-07T20:32:42.3597472Z         scale_ub: Optional[float],
2025-05-07T20:32:42.3597739Z         contiguous: bool,
2025-05-07T20:32:42.3597971Z         compiled: bool,
2025-05-07T20:32:42.3598191Z     ) -> None:
2025-05-07T20:32:42.3598394Z         torch.manual_seed(2025)
2025-05-07T20:32:42.3598630Z     
2025-05-07T20:32:42.3598891Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.3601065Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.3602993Z 
2025-05-07T20:32:42.3603109Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.3603316Z 
2025-05-07T20:32:42.3603422Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.3603818Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.3604224Z     T=4096,
2025-05-07T20:32:42.3604410Z     D=7168,
2025-05-07T20:32:42.3604596Z     scale_ub=1200.0,
2025-05-07T20:32:42.3604817Z     contiguous=True,
2025-05-07T20:32:42.3605042Z     compiled=False,
2025-05-07T20:32:42.3605230Z )
2025-05-07T20:32:42.3605538Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.3606016Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.3606275Z 
2025-05-07T20:32:42.3606350Z     @given(
2025-05-07T20:32:42.3606565Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.3606863Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.3607160Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.3607473Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.3607800Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.3608083Z     )
2025-05-07T20:32:42.3608410Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.3608845Z     def test_silu_mul_quant(
2025-05-07T20:32:42.3609082Z         self,
2025-05-07T20:32:42.3609261Z         T: int,
2025-05-07T20:32:42.3609568Z         D: int,
2025-05-07T20:32:42.3609779Z         scale_ub: Optional[float],
2025-05-07T20:32:42.3610036Z         contiguous: bool,
2025-05-07T20:32:42.3610267Z         compiled: bool,
2025-05-07T20:32:42.3610475Z     ) -> None:
2025-05-07T20:32:42.3610687Z         torch.manual_seed(2025)
2025-05-07T20:32:42.3610912Z     
2025-05-07T20:32:42.3611170Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.3613174Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.3615090Z 
2025-05-07T20:32:42.3615212Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.3615416Z 
2025-05-07T20:32:42.3615520Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.3615922Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.3616323Z     T=16384,
2025-05-07T20:32:42.3616503Z     D=7168,
2025-05-07T20:32:42.3616689Z     scale_ub=None,
2025-05-07T20:32:42.3616897Z     contiguous=False,
2025-05-07T20:32:42.3617123Z     compiled=True,
2025-05-07T20:32:42.3617342Z )
2025-05-07T20:32:42.3617659Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.3618142Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:42.3618412Z 
2025-05-07T20:32:42.3618488Z     @given(
2025-05-07T20:32:42.3618707Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.3619013Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.3619310Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.3619633Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.3620040Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.3620323Z     )
2025-05-07T20:32:42.3620566Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.3620654Z     def test_silu_mul_quant(
2025-05-07T20:32:42.3620732Z         self,
2025-05-07T20:32:42.3620806Z         T: int,
2025-05-07T20:32:42.3620876Z         D: int,
2025-05-07T20:32:42.3620972Z         scale_ub: Optional[float],
2025-05-07T20:32:42.3621058Z         contiguous: bool,
2025-05-07T20:32:42.3621138Z         compiled: bool,
2025-05-07T20:32:42.3621215Z     ) -> None:
2025-05-07T20:32:42.3621305Z         torch.manual_seed(2025)
2025-05-07T20:32:42.3621376Z     
2025-05-07T20:32:42.3621542Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.3623310Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.3623323Z 
2025-05-07T20:32:42.3623435Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.3623439Z 
2025-05-07T20:32:42.3623534Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.3623752Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.3623824Z     T=4096,
2025-05-07T20:32:42.3623895Z     D=7168,
2025-05-07T20:32:42.3623986Z     scale_ub=None,
2025-05-07T20:32:42.3624152Z     contiguous=True,
2025-05-07T20:32:42.3624238Z     compiled=False,
2025-05-07T20:32:42.3624319Z )
2025-05-07T20:32:42.3624538Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.3624702Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.3624713Z 
2025-05-07T20:32:42.3624786Z     @given(
2025-05-07T20:32:42.3624939Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.3625041Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.3625152Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.3625271Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.3625382Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.3625452Z     )
2025-05-07T20:32:42.3625697Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.3625785Z     def test_silu_mul_quant(
2025-05-07T20:32:42.3625867Z         self,
2025-05-07T20:32:42.3625943Z         T: int,
2025-05-07T20:32:42.3626016Z         D: int,
2025-05-07T20:32:42.3626113Z         scale_ub: Optional[float],
2025-05-07T20:32:42.3626201Z         contiguous: bool,
2025-05-07T20:32:42.3626282Z         compiled: bool,
2025-05-07T20:32:42.3626362Z     ) -> None:
2025-05-07T20:32:42.3626451Z         torch.manual_seed(2025)
2025-05-07T20:32:42.3626521Z     
2025-05-07T20:32:42.3626692Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.3628573Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.3628585Z 
2025-05-07T20:32:42.3628783Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.3628788Z 
2025-05-07T20:32:42.3628886Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.3629102Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.3629181Z     T=16384,
2025-05-07T20:32:42.3629257Z     D=7168,
2025-05-07T20:32:42.3629337Z     scale_ub=None,
2025-05-07T20:32:42.3629420Z     contiguous=True,
2025-05-07T20:32:42.3629502Z     compiled=False,
2025-05-07T20:32:42.3629575Z )
2025-05-07T20:32:42.3629787Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.3629956Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:42.3629961Z 
2025-05-07T20:32:42.3630040Z     @given(
2025-05-07T20:32:42.3630154Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.3630254Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.3630371Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.3630488Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.3630597Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.3630666Z     )
2025-05-07T20:32:42.3630904Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.3631001Z     def test_silu_mul_quant(
2025-05-07T20:32:42.3631074Z         self,
2025-05-07T20:32:42.3631144Z         T: int,
2025-05-07T20:32:42.3631223Z         D: int,
2025-05-07T20:32:42.3631317Z         scale_ub: Optional[float],
2025-05-07T20:32:42.3631400Z         contiguous: bool,
2025-05-07T20:32:42.3631485Z         compiled: bool,
2025-05-07T20:32:42.3631562Z     ) -> None:
2025-05-07T20:32:42.3631655Z         torch.manual_seed(2025)
2025-05-07T20:32:42.3631729Z     
2025-05-07T20:32:42.3631998Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.3633747Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.3633753Z 
2025-05-07T20:32:42.3633865Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.3633869Z 
2025-05-07T20:32:42.3633971Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.3634191Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.3634266Z     T=16384,
2025-05-07T20:32:42.3634350Z     D=7168,
2025-05-07T20:32:42.3634429Z     scale_ub=1200.0,
2025-05-07T20:32:42.3634506Z     contiguous=True,
2025-05-07T20:32:42.3634594Z     compiled=False,
2025-05-07T20:32:42.3634664Z )
2025-05-07T20:32:42.3634880Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.3635054Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:42.3635059Z 
2025-05-07T20:32:42.3635129Z     @given(
2025-05-07T20:32:42.3635246Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.3635339Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.3635447Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.3635566Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.3635673Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.3635746Z     )
2025-05-07T20:32:42.3635987Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.3636081Z     def test_silu_mul_quant(
2025-05-07T20:32:42.3636155Z         self,
2025-05-07T20:32:42.3636234Z         T: int,
2025-05-07T20:32:42.3636700Z         D: int,
2025-05-07T20:32:42.3636797Z         scale_ub: Optional[float],
2025-05-07T20:32:42.3636886Z         contiguous: bool,
2025-05-07T20:32:42.3636973Z         compiled: bool,
2025-05-07T20:32:42.3637070Z     ) -> None:
2025-05-07T20:32:42.3637167Z         torch.manual_seed(2025)
2025-05-07T20:32:42.3637254Z     
2025-05-07T20:32:42.3637418Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.3639176Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.3639186Z 
2025-05-07T20:32:42.3639300Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.5447209Z 
2025-05-07T20:32:42.5447562Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5448004Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5448519Z     T=128,
2025-05-07T20:32:42.5448707Z     D=5120,
2025-05-07T20:32:42.5448897Z     scale_ub=1200.0,
2025-05-07T20:32:42.5449110Z     contiguous=False,
2025-05-07T20:32:42.5449339Z     compiled=False,
2025-05-07T20:32:42.5449533Z )
2025-05-07T20:32:42.5449835Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5450314Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:42.5450582Z 
2025-05-07T20:32:42.5450864Z     @given(
2025-05-07T20:32:42.5451135Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5459454Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5459807Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5460131Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5460453Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5460731Z     )
2025-05-07T20:32:42.5461076Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5461523Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5461760Z         self,
2025-05-07T20:32:42.5461942Z         T: int,
2025-05-07T20:32:42.5462139Z         D: int,
2025-05-07T20:32:42.5462343Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5462604Z         contiguous: bool,
2025-05-07T20:32:42.5462838Z         compiled: bool,
2025-05-07T20:32:42.5463052Z     ) -> None:
2025-05-07T20:32:42.5463260Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5463499Z     
2025-05-07T20:32:42.5463756Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5464089Z     
2025-05-07T20:32:42.5464274Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.5464552Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.5464850Z         x = x_sign * x_clamp
2025-05-07T20:32:42.5465079Z         x0 = x[:, :D]
2025-05-07T20:32:42.5465289Z         x1 = x[:, D:]
2025-05-07T20:32:42.5465481Z     
2025-05-07T20:32:42.5465654Z         if contiguous:
2025-05-07T20:32:42.5465876Z             x0 = x0.contiguous()
2025-05-07T20:32:42.5466118Z             x1 = x1.contiguous()
2025-05-07T20:32:42.5466344Z     
2025-05-07T20:32:42.5466523Z         if scale_ub is not None:
2025-05-07T20:32:42.5466783Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.5467109Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.5467403Z             )
2025-05-07T20:32:42.5467659Z         else:
2025-05-07T20:32:42.5467863Z             scale_ub_tensor = None
2025-05-07T20:32:42.5468103Z     
2025-05-07T20:32:42.5468483Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5468791Z             op = silu_mul_quant
2025-05-07T20:32:42.5469033Z             if compiled:
2025-05-07T20:32:42.5469270Z                 op = torch.compile(op)
2025-05-07T20:32:42.5469557Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5469824Z     
2025-05-07T20:32:42.5470004Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.5470167Z 
2025-05-07T20:32:42.5470262Z moe/activation_test.py:117: 
2025-05-07T20:32:42.5470546Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5470865Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.5471133Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5471810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.5472492Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.5473021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.5473707Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.5474357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.5474873Z     kernel = self.compile(
2025-05-07T20:32:42.5475408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.5476048Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.5476433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5476653Z 
2025-05-07T20:32:42.5476861Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c5d5e650>
2025-05-07T20:32:42.5478008Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.5479358Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c4483600>}
2025-05-07T20:32:42.5480666Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.5481665Z context = <triton._C.libtriton.ir.context object at 0x7f13c46961f0>
2025-05-07T20:32:42.5481942Z 
2025-05-07T20:32:42.5482107Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.5482619Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.5483091Z                            module_map=module_map)
2025-05-07T20:32:42.5483449Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.5483792Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.5484041Z E       ^
2025-05-07T20:32:42.5484484Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.5484922Z 
2025-05-07T20:32:42.5485342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:42.5485843Z 
2025-05-07T20:32:42.5485942Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5486343Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5486737Z     T=2048,
2025-05-07T20:32:42.5486914Z     D=7168,
2025-05-07T20:32:42.5487120Z     scale_ub=None,
2025-05-07T20:32:42.5487355Z     contiguous=False,
2025-05-07T20:32:42.5487578Z     compiled=False,
2025-05-07T20:32:42.5487771Z )
2025-05-07T20:32:42.5488159Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5488634Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:42.5488898Z 
2025-05-07T20:32:42.5488975Z     @given(
2025-05-07T20:32:42.5489189Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5489486Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5489774Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5490091Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5490404Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5490676Z     )
2025-05-07T20:32:42.5491016Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5491452Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5491679Z         self,
2025-05-07T20:32:42.5491863Z         T: int,
2025-05-07T20:32:42.5492048Z         D: int,
2025-05-07T20:32:42.5492254Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5492512Z         contiguous: bool,
2025-05-07T20:32:42.5492750Z         compiled: bool,
2025-05-07T20:32:42.5492961Z     ) -> None:
2025-05-07T20:32:42.5493158Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5493390Z     
2025-05-07T20:32:42.5493655Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5495666Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:42.5497578Z 
2025-05-07T20:32:42.5497695Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:42.5497904Z 
2025-05-07T20:32:42.5498008Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:42.5498399Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:42.5498794Z     T=128,
2025-05-07T20:32:42.5498976Z     D=7168,
2025-05-07T20:32:42.5499154Z     scale_ub=1200.0,
2025-05-07T20:32:42.5499368Z     contiguous=True,
2025-05-07T20:32:42.5499576Z     compiled=True,
2025-05-07T20:32:42.5499761Z )
2025-05-07T20:32:42.5500080Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:42.5500549Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:42.5500816Z 
2025-05-07T20:32:42.5500895Z     @given(
2025-05-07T20:32:42.5501110Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:42.5501419Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:42.5501708Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:42.5502026Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:42.5502337Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:42.5502607Z     )
2025-05-07T20:32:42.5502939Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:42.5503384Z     def test_silu_mul_quant(
2025-05-07T20:32:42.5503610Z         self,
2025-05-07T20:32:42.5503788Z         T: int,
2025-05-07T20:32:42.5503974Z         D: int,
2025-05-07T20:32:42.5504193Z         scale_ub: Optional[float],
2025-05-07T20:32:42.5504451Z         contiguous: bool,
2025-05-07T20:32:42.5504685Z         compiled: bool,
2025-05-07T20:32:42.5504898Z     ) -> None:
2025-05-07T20:32:42.5505098Z         torch.manual_seed(2025)
2025-05-07T20:32:42.5505328Z     
2025-05-07T20:32:42.5505598Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:42.5505925Z     
2025-05-07T20:32:42.5506104Z         x_sign = torch.sign(x)
2025-05-07T20:32:42.5506466Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:42.5506761Z         x = x_sign * x_clamp
2025-05-07T20:32:42.5506985Z         x0 = x[:, :D]
2025-05-07T20:32:42.5507190Z         x1 = x[:, D:]
2025-05-07T20:32:42.5507388Z     
2025-05-07T20:32:42.5507607Z         if contiguous:
2025-05-07T20:32:42.5507828Z             x0 = x0.contiguous()
2025-05-07T20:32:42.5508080Z             x1 = x1.contiguous()
2025-05-07T20:32:42.5508299Z     
2025-05-07T20:32:42.5508481Z         if scale_ub is not None:
2025-05-07T20:32:42.5508739Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:42.5509058Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:42.5509354Z             )
2025-05-07T20:32:42.5509548Z         else:
2025-05-07T20:32:42.5509742Z             scale_ub_tensor = None
2025-05-07T20:32:42.5509981Z     
2025-05-07T20:32:42.5510213Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:42.5510504Z             op = silu_mul_quant
2025-05-07T20:32:42.5510759Z             if compiled:
2025-05-07T20:32:42.5511004Z                 op = torch.compile(op)
2025-05-07T20:32:42.5511289Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5511546Z     
2025-05-07T20:32:42.5511724Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:42.5511882Z 
2025-05-07T20:32:42.5511981Z moe/activation_test.py:117: 
2025-05-07T20:32:42.5512263Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5512582Z moe/activation_test.py:115: in fn
2025-05-07T20:32:42.5512851Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:42.5513403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:42.5513946Z     return fn(*args, **kwargs)
2025-05-07T20:32:42.5514581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:42.5515343Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:42.5515876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:42.5516539Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:42.5517218Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:42.5517750Z     kernel = self.compile(
2025-05-07T20:32:42.5518277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:42.5518908Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:42.5519298Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:42.5519520Z 
2025-05-07T20:32:42.5519729Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c459ba50>
2025-05-07T20:32:42.5520829Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:42.5522176Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c4260900>}
2025-05-07T20:32:42.5523484Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:42.5524484Z context = <triton._C.libtriton.ir.context object at 0x7f13c42b68b0>
2025-05-07T20:32:42.5524765Z 
2025-05-07T20:32:42.5524924Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:42.5525448Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:42.5525986Z                            module_map=module_map)
2025-05-07T20:32:42.5526341Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:42.5526681Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:42.5526932Z E       ^
2025-05-07T20:32:42.5527379Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:42.5527819Z 
2025-05-07T20:32:42.5528230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.1403380Z 
2025-05-07T20:32:43.1403981Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.1404598Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.1405147Z     T=128,
2025-05-07T20:32:43.1405356Z     D=7168,
2025-05-07T20:32:43.1405560Z     scale_ub=1200.0,
2025-05-07T20:32:43.1405815Z     contiguous=True,
2025-05-07T20:32:43.1406031Z     compiled=False,
2025-05-07T20:32:43.1406226Z )
2025-05-07T20:32:43.1406538Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.1407010Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:43.1407280Z 
2025-05-07T20:32:43.1407363Z     @given(
2025-05-07T20:32:43.1407583Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.1407890Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.1408185Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.1408498Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.1408815Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.1409087Z     )
2025-05-07T20:32:43.1409418Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.1410044Z     def test_silu_mul_quant(
2025-05-07T20:32:43.1410278Z         self,
2025-05-07T20:32:43.1410478Z         T: int,
2025-05-07T20:32:43.1410663Z         D: int,
2025-05-07T20:32:43.1410883Z         scale_ub: Optional[float],
2025-05-07T20:32:43.1411149Z         contiguous: bool,
2025-05-07T20:32:43.1411371Z         compiled: bool,
2025-05-07T20:32:43.1411592Z     ) -> None:
2025-05-07T20:32:43.1411793Z         torch.manual_seed(2025)
2025-05-07T20:32:43.1412025Z     
2025-05-07T20:32:43.1412298Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.1412645Z     
2025-05-07T20:32:43.1412826Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.1413113Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.1415080Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.1417009Z 
2025-05-07T20:32:43.1417128Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:43.1417335Z 
2025-05-07T20:32:43.1417446Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.1417837Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.1418240Z     T=128,
2025-05-07T20:32:43.1418417Z     D=5120,
2025-05-07T20:32:43.1418593Z     scale_ub=1200.0,
2025-05-07T20:32:43.1418808Z     contiguous=True,
2025-05-07T20:32:43.1419018Z     compiled=True,
2025-05-07T20:32:43.1419209Z )
2025-05-07T20:32:43.1419510Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.1419991Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:43.1420257Z 
2025-05-07T20:32:43.1420333Z     @given(
2025-05-07T20:32:43.1420668Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.1420974Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.1421273Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.1421590Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.1421904Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.1422176Z     )
2025-05-07T20:32:43.1422505Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.1422946Z     def test_silu_mul_quant(
2025-05-07T20:32:43.1423175Z         self,
2025-05-07T20:32:43.1423354Z         T: int,
2025-05-07T20:32:43.1423544Z         D: int,
2025-05-07T20:32:43.1423753Z         scale_ub: Optional[float],
2025-05-07T20:32:43.1424012Z         contiguous: bool,
2025-05-07T20:32:43.1424257Z         compiled: bool,
2025-05-07T20:32:43.1424472Z     ) -> None:
2025-05-07T20:32:43.1424686Z         torch.manual_seed(2025)
2025-05-07T20:32:43.1424916Z     
2025-05-07T20:32:43.1425183Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.1425519Z     
2025-05-07T20:32:43.1425697Z >       x_sign = torch.sign(x)
2025-05-07T20:32:43.1427653Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.1429555Z 
2025-05-07T20:32:43.1429755Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:43.1429957Z 
2025-05-07T20:32:43.1430060Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.1430464Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.1430867Z     T=128,
2025-05-07T20:32:43.1431041Z     D=7168,
2025-05-07T20:32:43.1431224Z     scale_ub=None,
2025-05-07T20:32:43.1431423Z     contiguous=True,
2025-05-07T20:32:43.1431636Z     compiled=True,
2025-05-07T20:32:43.1431832Z )
2025-05-07T20:32:43.1432135Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.1432605Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:43.1432857Z 
2025-05-07T20:32:43.1432939Z     @given(
2025-05-07T20:32:43.1433149Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.1433449Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.1433740Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.1434053Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.1434366Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.1434647Z     )
2025-05-07T20:32:43.1434982Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.1435403Z     def test_silu_mul_quant(
2025-05-07T20:32:43.1435636Z         self,
2025-05-07T20:32:43.1435817Z         T: int,
2025-05-07T20:32:43.1436002Z         D: int,
2025-05-07T20:32:43.1436208Z         scale_ub: Optional[float],
2025-05-07T20:32:43.1436472Z         contiguous: bool,
2025-05-07T20:32:43.1436696Z         compiled: bool,
2025-05-07T20:32:43.1436906Z     ) -> None:
2025-05-07T20:32:43.1437114Z         torch.manual_seed(2025)
2025-05-07T20:32:43.1437342Z     
2025-05-07T20:32:43.1437598Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.1439675Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.1441752Z 
2025-05-07T20:32:43.1441865Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.1442069Z 
2025-05-07T20:32:43.1493520Z FAILED
2025-05-07T20:32:43.1493677Z 
2025-05-07T20:32:43.1493862Z =================================== FAILURES ===================================
2025-05-07T20:32:43.1494447Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:32:43.1495057Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:32:43.1495777Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 58, in testPartExecutor
2025-05-07T20:32:43.1496311Z   |     yield
2025-05-07T20:32:43.1496921Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 651, in run
2025-05-07T20:32:43.1497655Z   |     self._callTestMethod(testMethod)
2025-05-07T20:32:43.1498037Z   |     ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:43.1498777Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 606, in _callTestMethod
2025-05-07T20:32:43.1499561Z   |     if method() is not None:
2025-05-07T20:32:43.1499894Z   |        ~~~~~~^^
2025-05-07T20:32:43.1500788Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:32:43.1501757Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.1502151Z   |            ^^^^^^^
2025-05-07T20:32:43.1503189Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:32:43.1504069Z   |     raise the_error_hypothesis_found
2025-05-07T20:32:43.1504637Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:32:43.1505206Z   +-+---------------- 1 ----------------
2025-05-07T20:32:43.1505594Z     | Traceback (most recent call last):
2025-05-07T20:32:43.1506551Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:43.1507769Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.1510569Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.1513268Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:43.1513864Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.1514396Z     |     T=128,
2025-05-07T20:32:43.1514670Z     |     D=7168,
2025-05-07T20:32:43.1514936Z     |     scale_ub=1200.0,
2025-05-07T20:32:43.1515177Z     |     contiguous=True,
2025-05-07T20:32:43.1515419Z     |     compiled=False,
2025-05-07T20:32:43.1515645Z     | )
2025-05-07T20:32:43.1515817Z     | 
2025-05-07T20:32:43.1516335Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAUEAQQE=') as a decorator on your test case
2025-05-07T20:32:43.1516940Z     +---------------- 2 ----------------
2025-05-07T20:32:43.1517261Z     | Traceback (most recent call last):
2025-05-07T20:32:43.1518142Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:43.1518909Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.1520949Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.1522889Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:43.1523340Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.1523749Z     |     T=128,
2025-05-07T20:32:43.1523949Z     |     D=7168,
2025-05-07T20:32:43.1524155Z     |     scale_ub=None,
2025-05-07T20:32:43.1524384Z     |     contiguous=True,
2025-05-07T20:32:43.1524627Z     |     compiled=True,
2025-05-07T20:32:43.1524849Z     | )
2025-05-07T20:32:43.1525026Z     | 
2025-05-07T20:32:43.1525551Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:43.1526154Z     +---------------- 3 ----------------
2025-05-07T20:32:43.1526440Z     | Traceback (most recent call last):
2025-05-07T20:32:43.1527130Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:32:43.1527975Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.1530077Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.1532132Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:43.1532566Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.1533025Z     |     T=128,
2025-05-07T20:32:43.1533231Z     |     D=5120,
2025-05-07T20:32:43.1545933Z     |     scale_ub=1200.0,
2025-05-07T20:32:43.1546338Z     |     contiguous=True,
2025-05-07T20:32:43.1546675Z     |     compiled=True,
2025-05-07T20:32:43.1546970Z     | )
2025-05-07T20:32:43.1547230Z     | 
2025-05-07T20:32:43.1548091Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:32:43.1548937Z     +---------------- 4 ----------------
2025-05-07T20:32:43.1549320Z     | Traceback (most recent call last):
2025-05-07T20:32:43.1550311Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:32:43.1551292Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:43.1551690Z     |                              ~~~~~~^^
2025-05-07T20:32:43.1552580Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:32:43.1553527Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.1554859Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:32:43.1555942Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:43.1556322Z     |     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
2025-05-07T20:32:43.1556676Z     |         a,
2025-05-07T20:32:43.1556925Z     |         ^^
2025-05-07T20:32:43.1557230Z     |     ...<23 lines>...
2025-05-07T20:32:43.1557575Z     |         USE_INT64=use_int64,
2025-05-07T20:32:43.1557916Z     |         ^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:43.1558230Z     |     )
2025-05-07T20:32:43.1558477Z     |     ^
2025-05-07T20:32:43.1559178Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:32:43.1560148Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.1560800Z     |                                    ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:43.1561699Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:32:43.1562753Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:43.1563386Z     |                        ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:43.1564260Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:32:43.1565208Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:43.1565720Z     |            ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:32:43.1566525Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:32:43.1567461Z     |     fn()
2025-05-07T20:32:43.1567657Z     |     ~~^^
2025-05-07T20:32:43.1568214Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:32:43.1568840Z     |     self.fn.run(
2025-05-07T20:32:43.1569058Z     |     ~~~~~~~~~~~^
2025-05-07T20:32:43.1569265Z     |         *args,
2025-05-07T20:32:43.1569471Z     |         ^^^^^^
2025-05-07T20:32:43.1569683Z     |         **current,
2025-05-07T20:32:43.1569898Z     |         ^^^^^^^^^^
2025-05-07T20:32:43.1570111Z     |     )
2025-05-07T20:32:43.1570296Z     |     ^
2025-05-07T20:32:43.1570785Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:32:43.1571370Z     |     kernel = self.compile(
2025-05-07T20:32:43.1571621Z     |         src,
2025-05-07T20:32:43.1571843Z     |         target=target,
2025-05-07T20:32:43.1572089Z     |         options=options.__dict__,
2025-05-07T20:32:43.1572352Z     |     )
2025-05-07T20:32:43.1572888Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:32:43.1573572Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.1574383Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:43.1575475Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.1576124Z     |                        module_map=module_map)
2025-05-07T20:32:43.1576611Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.1577092Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:32:43.1577505Z     | ^
2025-05-07T20:32:43.1578153Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.1578972Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:32:43.1579682Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:32:43.1580389Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.1581017Z     |     T=1,  # or any other generated value
2025-05-07T20:32:43.1581324Z     |     D=5120,  # or any other generated value
2025-05-07T20:32:43.1581649Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:32:43.1581995Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:32:43.1582346Z     |     compiled=True,  # or any other generated value
2025-05-07T20:32:43.1582642Z     | )
2025-05-07T20:32:43.1582811Z     | 
2025-05-07T20:32:43.1583332Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:32:43.1583955Z     +------------------------------------
2025-05-07T20:32:43.1584313Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:32:43.1584667Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.1585076Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.1585464Z     T=1,
2025-05-07T20:32:43.1586694Z     D=5120,
2025-05-07T20:32:43.1586969Z     scale_ub=None,
2025-05-07T20:32:43.1587272Z     contiguous=True,
2025-05-07T20:32:43.1587717Z     compiled=True,
2025-05-07T20:32:43.1588006Z )
2025-05-07T20:32:43.1588472Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.1589147Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:43.1589501Z 
2025-05-07T20:32:43.1589605Z     @given(
2025-05-07T20:32:43.1589919Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.1590496Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.1590899Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.1591347Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.1591786Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.1592158Z     )
2025-05-07T20:32:43.1592636Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.1593241Z     def test_silu_mul_quant(
2025-05-07T20:32:43.1593568Z         self,
2025-05-07T20:32:43.1593823Z         T: int,
2025-05-07T20:32:43.1594078Z         D: int,
2025-05-07T20:32:43.1594369Z         scale_ub: Optional[float],
2025-05-07T20:32:43.1594725Z         contiguous: bool,
2025-05-07T20:32:43.1595045Z         compiled: bool,
2025-05-07T20:32:43.1595351Z     ) -> None:
2025-05-07T20:32:43.1595630Z         torch.manual_seed(2025)
2025-05-07T20:32:43.1595957Z     
2025-05-07T20:32:43.1596315Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.1596779Z     
2025-05-07T20:32:43.1597031Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.1597416Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.1597828Z         x = x_sign * x_clamp
2025-05-07T20:32:43.1598157Z         x0 = x[:, :D]
2025-05-07T20:32:43.1598441Z         x1 = x[:, D:]
2025-05-07T20:32:43.1598713Z     
2025-05-07T20:32:43.1598962Z         if contiguous:
2025-05-07T20:32:43.1599274Z             x0 = x0.contiguous()
2025-05-07T20:32:43.1599616Z             x1 = x1.contiguous()
2025-05-07T20:32:43.1599946Z     
2025-05-07T20:32:43.1600205Z         if scale_ub is not None:
2025-05-07T20:32:43.1600585Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.1601028Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.1601415Z             )
2025-05-07T20:32:43.1601653Z         else:
2025-05-07T20:32:43.1601896Z             scale_ub_tensor = None
2025-05-07T20:32:43.1602205Z     
2025-05-07T20:32:43.1602498Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.1602903Z             op = silu_mul_quant
2025-05-07T20:32:43.1603233Z             if compiled:
2025-05-07T20:32:43.1603685Z                 op = torch.compile(op)
2025-05-07T20:32:43.1604086Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.1604476Z     
2025-05-07T20:32:43.1604746Z         y_fp8, y_scale = fn()
2025-05-07T20:32:43.1605136Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:43.1605533Z     
2025-05-07T20:32:43.1605856Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.1606304Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:43.1606693Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:43.1607125Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:43.1607645Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.1608063Z     
2025-05-07T20:32:43.1608337Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:43.1608610Z 
2025-05-07T20:32:43.1608752Z moe/activation_test.py:126: 
2025-05-07T20:32:43.1609148Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1609598Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:43.1610044Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.1611132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:43.1612163Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:43.1612928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.1613869Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.1614817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:43.1615948Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:43.1616981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:43.1617903Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:43.1618718Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:43.1619456Z     fn()
2025-05-07T20:32:43.1620161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:43.1620948Z     self.fn.run(
2025-05-07T20:32:43.1621573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.1622283Z     kernel = self.compile(
2025-05-07T20:32:43.1623020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.1623911Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.1624458Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1624783Z 
2025-05-07T20:32:43.1625061Z self = <triton.compiler.compiler.ASTSource object at 0x7f13e8212150>
2025-05-07T20:32:43.1626550Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.1628599Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13ea25d4e0>}
2025-05-07T20:32:43.1630450Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.1631831Z context = <triton._C.libtriton.ir.context object at 0x7f13ea2454f0>
2025-05-07T20:32:43.1632318Z 
2025-05-07T20:32:43.1632555Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.1633223Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.1633843Z                            module_map=module_map)
2025-05-07T20:32:43.1634305Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.1634740Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:43.1635082Z E       ^
2025-05-07T20:32:43.1635690Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.1636287Z 
2025-05-07T20:32:43.1636840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.1637514Z 
2025-05-07T20:32:43.1637658Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.1638216Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.1638756Z     T=2048,
2025-05-07T20:32:43.1639006Z     D=5120,
2025-05-07T20:32:43.1639255Z     scale_ub=1200.0,
2025-05-07T20:32:43.1639553Z     contiguous=True,
2025-05-07T20:32:43.1639850Z     compiled=False,
2025-05-07T20:32:43.1640434Z )
2025-05-07T20:32:43.1640871Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.1641537Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:43.1641914Z 
2025-05-07T20:32:43.1642023Z     @given(
2025-05-07T20:32:43.1642323Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.1642744Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.1643163Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.1643897Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.1644342Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.1644727Z     )
2025-05-07T20:32:43.1645203Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.1645803Z     def test_silu_mul_quant(
2025-05-07T20:32:43.1646124Z         self,
2025-05-07T20:32:43.1646361Z         T: int,
2025-05-07T20:32:43.1646608Z         D: int,
2025-05-07T20:32:43.1646882Z         scale_ub: Optional[float],
2025-05-07T20:32:43.1647216Z         contiguous: bool,
2025-05-07T20:32:43.1647516Z         compiled: bool,
2025-05-07T20:32:43.1647796Z     ) -> None:
2025-05-07T20:32:43.1648062Z         torch.manual_seed(2025)
2025-05-07T20:32:43.1648386Z     
2025-05-07T20:32:43.1648751Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.1649203Z     
2025-05-07T20:32:43.1649449Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.1649834Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.1650260Z         x = x_sign * x_clamp
2025-05-07T20:32:43.1650572Z         x0 = x[:, :D]
2025-05-07T20:32:43.1650864Z         x1 = x[:, D:]
2025-05-07T20:32:43.1651122Z     
2025-05-07T20:32:43.1651355Z         if contiguous:
2025-05-07T20:32:43.1651666Z             x0 = x0.contiguous()
2025-05-07T20:32:43.1651976Z             x1 = x1.contiguous()
2025-05-07T20:32:43.1652253Z     
2025-05-07T20:32:43.1652481Z         if scale_ub is not None:
2025-05-07T20:32:43.1652806Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.1653198Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.1653561Z             )
2025-05-07T20:32:43.1653797Z         else:
2025-05-07T20:32:43.1654071Z             scale_ub_tensor = None
2025-05-07T20:32:43.1654381Z     
2025-05-07T20:32:43.1654649Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.1655039Z             op = silu_mul_quant
2025-05-07T20:32:43.1655375Z             if compiled:
2025-05-07T20:32:43.1655709Z                 op = torch.compile(op)
2025-05-07T20:32:43.1656078Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.1656582Z     
2025-05-07T20:32:43.1656849Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.1657061Z 
2025-05-07T20:32:43.1657182Z moe/activation_test.py:117: 
2025-05-07T20:32:43.1657528Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1657923Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.1658285Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.1659198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.1660144Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.1660871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.1661790Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.1662722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.1663404Z     kernel = self.compile(
2025-05-07T20:32:43.1664150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.1665000Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.1665469Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1665750Z 
2025-05-07T20:32:43.1665995Z self = <triton.compiler.compiler.ASTSource object at 0x7f13ea24c950>
2025-05-07T20:32:43.1667572Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.1669581Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13ea28e160>}
2025-05-07T20:32:43.1671379Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.1672689Z context = <triton._C.libtriton.ir.context object at 0x7f13e11de8b0>
2025-05-07T20:32:43.1673038Z 
2025-05-07T20:32:43.1673252Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.1673931Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.1674502Z                            module_map=module_map)
2025-05-07T20:32:43.1674947Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.1675368Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.1675701Z E       ^
2025-05-07T20:32:43.1676262Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.1676830Z 
2025-05-07T20:32:43.1677349Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.1678034Z 
2025-05-07T20:32:43.1678183Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.1678744Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.1679297Z     T=2048,
2025-05-07T20:32:43.1679546Z     D=5120,
2025-05-07T20:32:43.1679803Z     scale_ub=1200.0,
2025-05-07T20:32:43.1680098Z     contiguous=True,
2025-05-07T20:32:43.1680398Z     compiled=True,
2025-05-07T20:32:43.1680672Z )
2025-05-07T20:32:43.1681100Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.1681762Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:43.1682132Z 
2025-05-07T20:32:43.1682242Z     @given(
2025-05-07T20:32:43.1682541Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.1683075Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.1683503Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.1683948Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.1684396Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.1684783Z     )
2025-05-07T20:32:43.1685259Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.1685857Z     def test_silu_mul_quant(
2025-05-07T20:32:43.1686188Z         self,
2025-05-07T20:32:43.1686450Z         T: int,
2025-05-07T20:32:43.1686711Z         D: int,
2025-05-07T20:32:43.1687005Z         scale_ub: Optional[float],
2025-05-07T20:32:43.1687363Z         contiguous: bool,
2025-05-07T20:32:43.1687679Z         compiled: bool,
2025-05-07T20:32:43.1687982Z     ) -> None:
2025-05-07T20:32:43.1688268Z         torch.manual_seed(2025)
2025-05-07T20:32:43.1688590Z     
2025-05-07T20:32:43.1688966Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.1689434Z     
2025-05-07T20:32:43.1689686Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.1690081Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.1690507Z         x = x_sign * x_clamp
2025-05-07T20:32:43.1690827Z         x0 = x[:, :D]
2025-05-07T20:32:43.1691119Z         x1 = x[:, D:]
2025-05-07T20:32:43.1691402Z     
2025-05-07T20:32:43.1691636Z         if contiguous:
2025-05-07T20:32:43.1691948Z             x0 = x0.contiguous()
2025-05-07T20:32:43.1692276Z             x1 = x1.contiguous()
2025-05-07T20:32:43.1692590Z     
2025-05-07T20:32:43.1692820Z         if scale_ub is not None:
2025-05-07T20:32:43.1693178Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.1693595Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.1694106Z             )
2025-05-07T20:32:43.1694375Z         else:
2025-05-07T20:32:43.1694668Z             scale_ub_tensor = None
2025-05-07T20:32:43.1695002Z     
2025-05-07T20:32:43.1695320Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.1695735Z             op = silu_mul_quant
2025-05-07T20:32:43.1696073Z             if compiled:
2025-05-07T20:32:43.1696425Z                 op = torch.compile(op)
2025-05-07T20:32:43.1696834Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.1697206Z     
2025-05-07T20:32:43.1697468Z         y_fp8, y_scale = fn()
2025-05-07T20:32:43.1697863Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:43.1698260Z     
2025-05-07T20:32:43.1698575Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.1699041Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:43.1699448Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:43.1699876Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:43.1700380Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.1700816Z     
2025-05-07T20:32:43.1701086Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:43.1701362Z 
2025-05-07T20:32:43.1701495Z moe/activation_test.py:126: 
2025-05-07T20:32:43.1701908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1702363Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:43.1702813Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.1703900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:43.1704927Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:43.1705686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.1706617Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.1707816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:43.1708824Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:43.1709823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:43.1710731Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:43.1711580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:43.1712302Z     fn()
2025-05-07T20:32:43.1712982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:43.1713781Z     self.fn.run(
2025-05-07T20:32:43.1714414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.1715134Z     kernel = self.compile(
2025-05-07T20:32:43.1715897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.1716797Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.1717329Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1717650Z 
2025-05-07T20:32:43.1717926Z self = <triton.compiler.compiler.ASTSource object at 0x7f13f8b558d0>
2025-05-07T20:32:43.1719407Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.1721319Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13ea1be660>}
2025-05-07T20:32:43.1723263Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.1724606Z context = <triton._C.libtriton.ir.context object at 0x7f13e0258930>
2025-05-07T20:32:43.1724994Z 
2025-05-07T20:32:43.1725223Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.1725862Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.1741654Z                            module_map=module_map)
2025-05-07T20:32:43.1742166Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.1742629Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:43.1742979Z E       ^
2025-05-07T20:32:43.1743627Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.1744303Z 
2025-05-07T20:32:43.1744894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.1745597Z 
2025-05-07T20:32:43.1745738Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.1746283Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.1746819Z     T=16384,
2025-05-07T20:32:43.1747082Z     D=7168,
2025-05-07T20:32:43.1747332Z     scale_ub=1200.0,
2025-05-07T20:32:43.1747746Z     contiguous=False,
2025-05-07T20:32:43.1748043Z     compiled=False,
2025-05-07T20:32:43.1748308Z )
2025-05-07T20:32:43.1748710Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.1749365Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:43.1749729Z 
2025-05-07T20:32:43.1749839Z     @given(
2025-05-07T20:32:43.1750136Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.1750533Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.1750925Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.1751689Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.1752145Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.1752523Z     )
2025-05-07T20:32:43.1752947Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.1753511Z     def test_silu_mul_quant(
2025-05-07T20:32:43.1753798Z         self,
2025-05-07T20:32:43.1754030Z         T: int,
2025-05-07T20:32:43.1754257Z         D: int,
2025-05-07T20:32:43.1754517Z         scale_ub: Optional[float],
2025-05-07T20:32:43.1754839Z         contiguous: bool,
2025-05-07T20:32:43.1755115Z         compiled: bool,
2025-05-07T20:32:43.1755381Z     ) -> None:
2025-05-07T20:32:43.1755645Z         torch.manual_seed(2025)
2025-05-07T20:32:43.1755931Z     
2025-05-07T20:32:43.1756271Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.1756718Z     
2025-05-07T20:32:43.1756945Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.1757328Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.1757749Z         x = x_sign * x_clamp
2025-05-07T20:32:43.1758062Z         x0 = x[:, :D]
2025-05-07T20:32:43.1758346Z         x1 = x[:, D:]
2025-05-07T20:32:43.1758627Z     
2025-05-07T20:32:43.1758859Z         if contiguous:
2025-05-07T20:32:43.1759149Z             x0 = x0.contiguous()
2025-05-07T20:32:43.1759483Z             x1 = x1.contiguous()
2025-05-07T20:32:43.1759781Z     
2025-05-07T20:32:43.1760026Z         if scale_ub is not None:
2025-05-07T20:32:43.1760381Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.1760808Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.1761191Z             )
2025-05-07T20:32:43.1761437Z         else:
2025-05-07T20:32:43.1761705Z             scale_ub_tensor = None
2025-05-07T20:32:43.1762186Z     
2025-05-07T20:32:43.1762480Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.1762881Z             op = silu_mul_quant
2025-05-07T20:32:43.1763198Z             if compiled:
2025-05-07T20:32:43.1763512Z                 op = torch.compile(op)
2025-05-07T20:32:43.1763889Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.1764235Z     
2025-05-07T20:32:43.1764479Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.1764690Z 
2025-05-07T20:32:43.1764824Z moe/activation_test.py:117: 
2025-05-07T20:32:43.1765192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1765623Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.1765979Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.1766881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.1767778Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.1768507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.1769435Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.1770333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.1771058Z     kernel = self.compile(
2025-05-07T20:32:43.1771803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.1772688Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.1773208Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1773526Z 
2025-05-07T20:32:43.1773798Z self = <triton.compiler.compiler.ASTSource object at 0x7f13ea2a4550>
2025-05-07T20:32:43.1775415Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.1777404Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13ea2a1080>}
2025-05-07T20:32:43.1779214Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.1780599Z context = <triton._C.libtriton.ir.context object at 0x7f13c7aadaf0>
2025-05-07T20:32:43.1780985Z 
2025-05-07T20:32:43.1781196Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.1781887Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.1782512Z                            module_map=module_map)
2025-05-07T20:32:43.1782995Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.1783473Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.1783813Z E       ^
2025-05-07T20:32:43.1784428Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.1785043Z 
2025-05-07T20:32:43.1785830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.1786546Z 
2025-05-07T20:32:43.1786695Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.1787281Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.1787904Z     T=1,
2025-05-07T20:32:43.1788151Z     D=7168,
2025-05-07T20:32:43.1788410Z     scale_ub=None,
2025-05-07T20:32:43.1788694Z     contiguous=True,
2025-05-07T20:32:43.1788997Z     compiled=True,
2025-05-07T20:32:43.1789384Z )
2025-05-07T20:32:43.1789809Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.1790474Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:43.1790814Z 
2025-05-07T20:32:43.1790920Z     @given(
2025-05-07T20:32:43.1791210Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.1791638Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.1792045Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.1792452Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.1792884Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.1793274Z     )
2025-05-07T20:32:43.1793753Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.1794351Z     def test_silu_mul_quant(
2025-05-07T20:32:43.1794683Z         self,
2025-05-07T20:32:43.1794945Z         T: int,
2025-05-07T20:32:43.1795200Z         D: int,
2025-05-07T20:32:43.1795498Z         scale_ub: Optional[float],
2025-05-07T20:32:43.1795866Z         contiguous: bool,
2025-05-07T20:32:43.1796182Z         compiled: bool,
2025-05-07T20:32:43.1796489Z     ) -> None:
2025-05-07T20:32:43.1796781Z         torch.manual_seed(2025)
2025-05-07T20:32:43.1797114Z     
2025-05-07T20:32:43.1797519Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.1797985Z     
2025-05-07T20:32:43.1798236Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.1798641Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.1799068Z         x = x_sign * x_clamp
2025-05-07T20:32:43.1799397Z         x0 = x[:, :D]
2025-05-07T20:32:43.1799685Z         x1 = x[:, D:]
2025-05-07T20:32:43.1799965Z     
2025-05-07T20:32:43.1800223Z         if contiguous:
2025-05-07T20:32:43.1800529Z             x0 = x0.contiguous()
2025-05-07T20:32:43.1800889Z             x1 = x1.contiguous()
2025-05-07T20:32:43.1801220Z     
2025-05-07T20:32:43.1801473Z         if scale_ub is not None:
2025-05-07T20:32:43.1801855Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.1802309Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.1802826Z             )
2025-05-07T20:32:43.1803057Z         else:
2025-05-07T20:32:43.1803312Z             scale_ub_tensor = None
2025-05-07T20:32:43.1803646Z     
2025-05-07T20:32:43.1803961Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.1804392Z             op = silu_mul_quant
2025-05-07T20:32:43.1804733Z             if compiled:
2025-05-07T20:32:43.1805072Z                 op = torch.compile(op)
2025-05-07T20:32:43.1805475Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.1805848Z     
2025-05-07T20:32:43.1806103Z         y_fp8, y_scale = fn()
2025-05-07T20:32:43.1806488Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:43.1806893Z     
2025-05-07T20:32:43.1807207Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.1807666Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:43.1808059Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:43.1808482Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:43.1808958Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.1809341Z     
2025-05-07T20:32:43.1809588Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:43.1809843Z 
2025-05-07T20:32:43.1809965Z moe/activation_test.py:126: 
2025-05-07T20:32:43.1810338Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1810777Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:43.1811195Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.1812267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:43.1813292Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:43.1814137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.1815073Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.1816018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:43.1817004Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:43.1817989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:43.1818822Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:43.1819557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:43.1820188Z     fn()
2025-05-07T20:32:43.1820800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:43.1821595Z     self.fn.run(
2025-05-07T20:32:43.1822262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.1823001Z     kernel = self.compile(
2025-05-07T20:32:43.1823781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.1824703Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.1825248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1825562Z 
2025-05-07T20:32:43.1825834Z self = <triton.compiler.compiler.ASTSource object at 0x7f13f8a8b150>
2025-05-07T20:32:43.1826980Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.1828617Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13ea2a3740>}
2025-05-07T20:32:43.1829960Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.1830960Z context = <triton._C.libtriton.ir.context object at 0x7f13c7889cb0>
2025-05-07T20:32:43.1831255Z 
2025-05-07T20:32:43.1831421Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.1831938Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.1832407Z                            module_map=module_map)
2025-05-07T20:32:43.1832762Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.1833112Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:43.1833373Z E       ^
2025-05-07T20:32:43.1833830Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.1834276Z 
2025-05-07T20:32:43.1834690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.1835201Z 
2025-05-07T20:32:43.1835300Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.1835701Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.1836102Z     T=4096,
2025-05-07T20:32:43.1836287Z     D=5120,
2025-05-07T20:32:43.1836466Z     scale_ub=None,
2025-05-07T20:32:43.1836671Z     contiguous=False,
2025-05-07T20:32:43.1836904Z     compiled=False,
2025-05-07T20:32:43.1837111Z )
2025-05-07T20:32:43.1837428Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.1837912Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:43.1838282Z 
2025-05-07T20:32:43.1838356Z     @given(
2025-05-07T20:32:43.1838597Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.1838898Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.1839200Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.1839530Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.1839843Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.1840477Z     )
2025-05-07T20:32:43.1840836Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.1841448Z     def test_silu_mul_quant(
2025-05-07T20:32:43.1841735Z         self,
2025-05-07T20:32:43.1842194Z         T: int,
2025-05-07T20:32:43.1842475Z         D: int,
2025-05-07T20:32:43.1842769Z         scale_ub: Optional[float],
2025-05-07T20:32:43.1843184Z         contiguous: bool,
2025-05-07T20:32:43.1843482Z         compiled: bool,
2025-05-07T20:32:43.1843792Z     ) -> None:
2025-05-07T20:32:43.1844141Z         torch.manual_seed(2025)
2025-05-07T20:32:43.1845068Z     
2025-05-07T20:32:43.1845430Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.1845917Z     
2025-05-07T20:32:43.1846177Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.1846553Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.1847018Z         x = x_sign * x_clamp
2025-05-07T20:32:43.1847391Z         x0 = x[:, :D]
2025-05-07T20:32:43.1847674Z         x1 = x[:, D:]
2025-05-07T20:32:43.1848023Z     
2025-05-07T20:32:43.1848306Z         if contiguous:
2025-05-07T20:32:43.1848607Z             x0 = x0.contiguous()
2025-05-07T20:32:43.1848983Z             x1 = x1.contiguous()
2025-05-07T20:32:43.1849376Z     
2025-05-07T20:32:43.1849635Z         if scale_ub is not None:
2025-05-07T20:32:43.1850025Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.1850457Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.1850832Z             )
2025-05-07T20:32:43.1851147Z         else:
2025-05-07T20:32:43.1851459Z             scale_ub_tensor = None
2025-05-07T20:32:43.1852009Z     
2025-05-07T20:32:43.1852333Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.1852763Z             op = silu_mul_quant
2025-05-07T20:32:43.1853100Z             if compiled:
2025-05-07T20:32:43.1853426Z                 op = torch.compile(op)
2025-05-07T20:32:43.1853914Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.1854268Z     
2025-05-07T20:32:43.1854546Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.1854810Z 
2025-05-07T20:32:43.1854933Z moe/activation_test.py:117: 
2025-05-07T20:32:43.1855313Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1855821Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.1856154Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.1856920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.1857817Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.1858400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.1859168Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.1862194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.1862805Z     kernel = self.compile(
2025-05-07T20:32:43.1863399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.1864206Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.1864688Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1865116Z 
2025-05-07T20:32:43.1865417Z self = <triton.compiler.compiler.ASTSource object at 0x7f13ea213bd0>
2025-05-07T20:32:43.1866869Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.1868492Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13e0527100>}
2025-05-07T20:32:43.1869947Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.1871107Z context = <triton._C.libtriton.ir.context object at 0x7f13c715f170>
2025-05-07T20:32:43.1871415Z 
2025-05-07T20:32:43.1871666Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.1872306Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.1872964Z                            module_map=module_map)
2025-05-07T20:32:43.1873442Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.1873915Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.1874233Z E       ^
2025-05-07T20:32:43.1874796Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.1875264Z 
2025-05-07T20:32:43.1875774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.1876320Z 
2025-05-07T20:32:43.1876509Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.1876965Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.1877538Z     T=4096,
2025-05-07T20:32:43.1877849Z     D=7168,
2025-05-07T20:32:43.1878089Z     scale_ub=None,
2025-05-07T20:32:43.1878500Z     contiguous=False,
2025-05-07T20:32:43.1878840Z     compiled=False,
2025-05-07T20:32:43.1879182Z )
2025-05-07T20:32:43.1879619Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.1880226Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:43.1880518Z 
2025-05-07T20:32:43.1880654Z     @given(
2025-05-07T20:32:43.1880962Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.1881390Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.1881777Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.1882210Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.1882640Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.1883080Z     )
2025-05-07T20:32:43.1883533Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.1884091Z     def test_silu_mul_quant(
2025-05-07T20:32:43.1884412Z         self,
2025-05-07T20:32:43.1884783Z         T: int,
2025-05-07T20:32:43.1885026Z         D: int,
2025-05-07T20:32:43.1885352Z         scale_ub: Optional[float],
2025-05-07T20:32:43.1885776Z         contiguous: bool,
2025-05-07T20:32:43.1886063Z         compiled: bool,
2025-05-07T20:32:43.1886388Z     ) -> None:
2025-05-07T20:32:43.1886734Z         torch.manual_seed(2025)
2025-05-07T20:32:43.1887018Z     
2025-05-07T20:32:43.1887470Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.1887941Z     
2025-05-07T20:32:43.1888180Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.1888579Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.1889018Z         x = x_sign * x_clamp
2025-05-07T20:32:43.1889356Z         x0 = x[:, :D]
2025-05-07T20:32:43.1889618Z         x1 = x[:, D:]
2025-05-07T20:32:43.1889959Z     
2025-05-07T20:32:43.1890249Z         if contiguous:
2025-05-07T20:32:43.1890613Z             x0 = x0.contiguous()
2025-05-07T20:32:43.1891001Z             x1 = x1.contiguous()
2025-05-07T20:32:43.1891406Z     
2025-05-07T20:32:43.1891650Z         if scale_ub is not None:
2025-05-07T20:32:43.1892055Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.1892495Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.1892846Z             )
2025-05-07T20:32:43.1893194Z         else:
2025-05-07T20:32:43.1893487Z             scale_ub_tensor = None
2025-05-07T20:32:43.1893801Z     
2025-05-07T20:32:43.1894164Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.1894555Z             op = silu_mul_quant
2025-05-07T20:32:43.1894873Z             if compiled:
2025-05-07T20:32:43.1895259Z                 op = torch.compile(op)
2025-05-07T20:32:43.1895678Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.1895776Z     
2025-05-07T20:32:43.1895913Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.1895925Z 
2025-05-07T20:32:43.1896125Z moe/activation_test.py:117: 
2025-05-07T20:32:43.1896328Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1896461Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.1896585Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.1897164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.1897272Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.1897784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.1898082Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.1898444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.1898601Z     kernel = self.compile(
2025-05-07T20:32:43.1899014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.1899309Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.1899569Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1899574Z 
2025-05-07T20:32:43.1899806Z self = <triton.compiler.compiler.ASTSource object at 0x7f13e05408d0>
2025-05-07T20:32:43.1900637Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.1901215Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13e0526f20>}
2025-05-07T20:32:43.1901999Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.1902278Z context = <triton._C.libtriton.ir.context object at 0x7f13c726c9b0>
2025-05-07T20:32:43.1902283Z 
2025-05-07T20:32:43.1902529Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.1902856Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.1902992Z                            module_map=module_map)
2025-05-07T20:32:43.1903204Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.1903349Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.1903529Z E       ^
2025-05-07T20:32:43.1903962Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.1903968Z 
2025-05-07T20:32:43.1904415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.1904523Z 
2025-05-07T20:32:43.1904653Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.1904949Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.1905041Z     T=128,
2025-05-07T20:32:43.1905248Z     D=7168,
2025-05-07T20:32:43.1905358Z     scale_ub=None,
2025-05-07T20:32:43.1905495Z     contiguous=False,
2025-05-07T20:32:43.1905640Z     compiled=True,
2025-05-07T20:32:43.1905738Z )
2025-05-07T20:32:43.1905976Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.1906277Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:43.1906282Z 
2025-05-07T20:32:43.1906414Z     @given(
2025-05-07T20:32:43.1906596Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.1906721Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.1906863Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.1907118Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.1907272Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.1907400Z     )
2025-05-07T20:32:43.1907760Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.1907883Z     def test_silu_mul_quant(
2025-05-07T20:32:43.1907988Z         self,
2025-05-07T20:32:43.1908161Z         T: int,
2025-05-07T20:32:43.1908304Z         D: int,
2025-05-07T20:32:43.1908462Z         scale_ub: Optional[float],
2025-05-07T20:32:43.1908577Z         contiguous: bool,
2025-05-07T20:32:43.1908689Z         compiled: bool,
2025-05-07T20:32:43.1908815Z     ) -> None:
2025-05-07T20:32:43.1908983Z         torch.manual_seed(2025)
2025-05-07T20:32:43.1909123Z     
2025-05-07T20:32:43.1909352Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.1909453Z     
2025-05-07T20:32:43.1909573Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.1909750Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.1909913Z         x = x_sign * x_clamp
2025-05-07T20:32:43.1910175Z         x0 = x[:, :D]
2025-05-07T20:32:43.1910287Z         x1 = x[:, D:]
2025-05-07T20:32:43.1910387Z     
2025-05-07T20:32:43.1910565Z         if contiguous:
2025-05-07T20:32:43.1910695Z             x0 = x0.contiguous()
2025-05-07T20:32:43.1910863Z             x1 = x1.contiguous()
2025-05-07T20:32:43.1911010Z     
2025-05-07T20:32:43.1911127Z         if scale_ub is not None:
2025-05-07T20:32:43.1911258Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.1911593Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.1911680Z             )
2025-05-07T20:32:43.1911832Z         else:
2025-05-07T20:32:43.1911997Z             scale_ub_tensor = None
2025-05-07T20:32:43.1912092Z     
2025-05-07T20:32:43.1912282Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.1912427Z             op = silu_mul_quant
2025-05-07T20:32:43.1912528Z             if compiled:
2025-05-07T20:32:43.1912752Z                 op = torch.compile(op)
2025-05-07T20:32:43.1912891Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.1912988Z     
2025-05-07T20:32:43.1913165Z         y_fp8, y_scale = fn()
2025-05-07T20:32:43.1913312Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:43.1913393Z     
2025-05-07T20:32:43.1913653Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.1913780Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:43.1913944Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:43.1914190Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:43.1914355Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.1914534Z     
2025-05-07T20:32:43.1914676Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:43.1914681Z 
2025-05-07T20:32:43.1914804Z moe/activation_test.py:126: 
2025-05-07T20:32:43.1915099Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1915235Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:43.1915417Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.1916074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:43.1916215Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:43.1916661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.1916908Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.1917357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:43.1917623Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:43.1918099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:43.1918392Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:43.1918761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:43.1918869Z     fn()
2025-05-07T20:32:43.1919345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:43.1919442Z     self.fn.run(
2025-05-07T20:32:43.1919918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.1920075Z     kernel = self.compile(
2025-05-07T20:32:43.1920559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.1920792Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.1920953Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1920958Z 
2025-05-07T20:32:43.1921340Z self = <triton.compiler.compiler.ASTSource object at 0x7f13e03d97d0>
2025-05-07T20:32:43.1922173Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.1922742Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13e0527e20>}
2025-05-07T20:32:43.1923537Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.1923759Z context = <triton._C.libtriton.ir.context object at 0x7f13c6fc8fb0>
2025-05-07T20:32:43.1923764Z 
2025-05-07T20:32:43.1923982Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.1924336Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.1924517Z                            module_map=module_map)
2025-05-07T20:32:43.1924705Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.1924831Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:43.1924966Z E       ^
2025-05-07T20:32:43.1925312Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.1942757Z 
2025-05-07T20:32:43.1943229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.1943236Z 
2025-05-07T20:32:43.1943337Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.1943748Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.1943826Z     T=128,
2025-05-07T20:32:43.1943904Z     D=7168,
2025-05-07T20:32:43.1943992Z     scale_ub=None,
2025-05-07T20:32:43.1944076Z     contiguous=False,
2025-05-07T20:32:43.1944155Z     compiled=False,
2025-05-07T20:32:43.1944232Z )
2025-05-07T20:32:43.1944447Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.1944624Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:43.1944629Z 
2025-05-07T20:32:43.1944707Z     @given(
2025-05-07T20:32:43.1944824Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.1944930Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.1945044Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.1945160Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.1945281Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.1945359Z     )
2025-05-07T20:32:43.1945602Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.1945702Z     def test_silu_mul_quant(
2025-05-07T20:32:43.1945776Z         self,
2025-05-07T20:32:43.1945854Z         T: int,
2025-05-07T20:32:43.1945925Z         D: int,
2025-05-07T20:32:43.1946019Z         scale_ub: Optional[float],
2025-05-07T20:32:43.1946112Z         contiguous: bool,
2025-05-07T20:32:43.1946194Z         compiled: bool,
2025-05-07T20:32:43.1946267Z     ) -> None:
2025-05-07T20:32:43.1946364Z         torch.manual_seed(2025)
2025-05-07T20:32:43.1946429Z     
2025-05-07T20:32:43.1946596Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.1946673Z     
2025-05-07T20:32:43.1946765Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.1946885Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.1946976Z         x = x_sign * x_clamp
2025-05-07T20:32:43.1947061Z         x0 = x[:, :D]
2025-05-07T20:32:43.1947148Z         x1 = x[:, D:]
2025-05-07T20:32:43.1947219Z     
2025-05-07T20:32:43.1947300Z         if contiguous:
2025-05-07T20:32:43.1947634Z             x0 = x0.contiguous()
2025-05-07T20:32:43.1947725Z             x1 = x1.contiguous()
2025-05-07T20:32:43.1947793Z     
2025-05-07T20:32:43.1947885Z         if scale_ub is not None:
2025-05-07T20:32:43.1947988Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.1948120Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.1948205Z             )
2025-05-07T20:32:43.1948280Z         else:
2025-05-07T20:32:43.1948370Z             scale_ub_tensor = None
2025-05-07T20:32:43.1948448Z     
2025-05-07T20:32:43.1948573Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.1948658Z             op = silu_mul_quant
2025-05-07T20:32:43.1948743Z             if compiled:
2025-05-07T20:32:43.1948839Z                 op = torch.compile(op)
2025-05-07T20:32:43.1948951Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.1949017Z     
2025-05-07T20:32:43.1949102Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.1949107Z 
2025-05-07T20:32:43.1949211Z moe/activation_test.py:117: 
2025-05-07T20:32:43.1949333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1949428Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.1949533Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.1950028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.1950128Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.1950484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.1950703Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.1951049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.1951222Z     kernel = self.compile(
2025-05-07T20:32:43.1951621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.1951796Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.1951918Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1951923Z 
2025-05-07T20:32:43.1952124Z self = <triton.compiler.compiler.ASTSource object at 0x7f13e0588fd0>
2025-05-07T20:32:43.1952886Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.1953377Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c73999e0>}
2025-05-07T20:32:43.1954121Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.1954304Z context = <triton._C.libtriton.ir.context object at 0x7f13c6ffbe70>
2025-05-07T20:32:43.1954309Z 
2025-05-07T20:32:43.1954472Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.1954729Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.1954838Z                            module_map=module_map)
2025-05-07T20:32:43.1954996Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.1955088Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.1955163Z E       ^
2025-05-07T20:32:43.1955512Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.1955521Z 
2025-05-07T20:32:43.1956027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.1956033Z 
2025-05-07T20:32:43.1956136Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.1956352Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.1956430Z     T=4096,
2025-05-07T20:32:43.1956502Z     D=5120,
2025-05-07T20:32:43.1956581Z     scale_ub=1200.0,
2025-05-07T20:32:43.1956661Z     contiguous=True,
2025-05-07T20:32:43.1956739Z     compiled=False,
2025-05-07T20:32:43.1956809Z )
2025-05-07T20:32:43.1957028Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.1957219Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:43.1957224Z 
2025-05-07T20:32:43.1957300Z     @given(
2025-05-07T20:32:43.1957450Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.1957549Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.1957666Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.1957778Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.1957887Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.1957963Z     )
2025-05-07T20:32:43.1958203Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.1958292Z     def test_silu_mul_quant(
2025-05-07T20:32:43.1958367Z         self,
2025-05-07T20:32:43.1958440Z         T: int,
2025-05-07T20:32:43.1958512Z         D: int,
2025-05-07T20:32:43.1958610Z         scale_ub: Optional[float],
2025-05-07T20:32:43.1958694Z         contiguous: bool,
2025-05-07T20:32:43.1958777Z         compiled: bool,
2025-05-07T20:32:43.1958857Z     ) -> None:
2025-05-07T20:32:43.1958947Z         torch.manual_seed(2025)
2025-05-07T20:32:43.1959111Z     
2025-05-07T20:32:43.1959272Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.1959339Z     
2025-05-07T20:32:43.1959438Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.1959560Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.1959645Z         x = x_sign * x_clamp
2025-05-07T20:32:43.1959724Z         x0 = x[:, :D]
2025-05-07T20:32:43.1959799Z         x1 = x[:, D:]
2025-05-07T20:32:43.1959869Z     
2025-05-07T20:32:43.1959954Z         if contiguous:
2025-05-07T20:32:43.1960042Z             x0 = x0.contiguous()
2025-05-07T20:32:43.1960123Z             x1 = x1.contiguous()
2025-05-07T20:32:43.1960193Z     
2025-05-07T20:32:43.1960281Z         if scale_ub is not None:
2025-05-07T20:32:43.1960385Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.1960514Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.1960584Z             )
2025-05-07T20:32:43.1960659Z         else:
2025-05-07T20:32:43.1960754Z             scale_ub_tensor = None
2025-05-07T20:32:43.1960823Z     
2025-05-07T20:32:43.1960951Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.1961041Z             op = silu_mul_quant
2025-05-07T20:32:43.1961120Z             if compiled:
2025-05-07T20:32:43.1961220Z                 op = torch.compile(op)
2025-05-07T20:32:43.1961321Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.1961390Z     
2025-05-07T20:32:43.1961479Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.1961484Z 
2025-05-07T20:32:43.1961574Z moe/activation_test.py:117: 
2025-05-07T20:32:43.1961704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1961800Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.1961895Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.1962387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.1962489Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.1962842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.1963140Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.1963478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.1963571Z     kernel = self.compile(
2025-05-07T20:32:43.1963951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.1964119Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.1964245Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1964249Z 
2025-05-07T20:32:43.1964444Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c774b4d0>
2025-05-07T20:32:43.1965216Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.1965709Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c739a200>}
2025-05-07T20:32:43.1966440Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.1966627Z context = <triton._C.libtriton.ir.context object at 0x7f13c687f870>
2025-05-07T20:32:43.1966631Z 
2025-05-07T20:32:43.1966793Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.1967052Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.1967258Z                            module_map=module_map)
2025-05-07T20:32:43.1967442Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.1967550Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.1967626Z E       ^
2025-05-07T20:32:43.1967980Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.1967985Z 
2025-05-07T20:32:43.1968391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.1968396Z 
2025-05-07T20:32:43.1968491Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.1968713Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.1968786Z     T=1,
2025-05-07T20:32:43.1968860Z     D=5120,
2025-05-07T20:32:43.1968946Z     scale_ub=None,
2025-05-07T20:32:43.1969029Z     contiguous=True,
2025-05-07T20:32:43.1969109Z     compiled=True,
2025-05-07T20:32:43.1969184Z )
2025-05-07T20:32:43.1969395Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.1969559Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:43.1969563Z 
2025-05-07T20:32:43.1969637Z     @given(
2025-05-07T20:32:43.1969752Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.1969849Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.1969960Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.1970069Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.1970180Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.1970252Z     )
2025-05-07T20:32:43.1970491Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.1970580Z     def test_silu_mul_quant(
2025-05-07T20:32:43.1970652Z         self,
2025-05-07T20:32:43.1970735Z         T: int,
2025-05-07T20:32:43.1970813Z         D: int,
2025-05-07T20:32:43.1970905Z         scale_ub: Optional[float],
2025-05-07T20:32:43.1970993Z         contiguous: bool,
2025-05-07T20:32:43.1971152Z         compiled: bool,
2025-05-07T20:32:43.1971227Z     ) -> None:
2025-05-07T20:32:43.1971321Z         torch.manual_seed(2025)
2025-05-07T20:32:43.1971390Z     
2025-05-07T20:32:43.1971550Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.1971624Z     
2025-05-07T20:32:43.1971711Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.1971834Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.1971920Z         x = x_sign * x_clamp
2025-05-07T20:32:43.1971997Z         x0 = x[:, :D]
2025-05-07T20:32:43.1972076Z         x1 = x[:, D:]
2025-05-07T20:32:43.1972144Z     
2025-05-07T20:32:43.1972223Z         if contiguous:
2025-05-07T20:32:43.1972311Z             x0 = x0.contiguous()
2025-05-07T20:32:43.1972394Z             x1 = x1.contiguous()
2025-05-07T20:32:43.1972463Z     
2025-05-07T20:32:43.1972558Z         if scale_ub is not None:
2025-05-07T20:32:43.1972660Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.1972796Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.1972875Z             )
2025-05-07T20:32:43.1972948Z         else:
2025-05-07T20:32:43.1973039Z             scale_ub_tensor = None
2025-05-07T20:32:43.1973113Z     
2025-05-07T20:32:43.1973237Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.1973326Z             op = silu_mul_quant
2025-05-07T20:32:43.1973406Z             if compiled:
2025-05-07T20:32:43.1973499Z                 op = torch.compile(op)
2025-05-07T20:32:43.1973603Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.1973671Z     
2025-05-07T20:32:43.1973756Z         y_fp8, y_scale = fn()
2025-05-07T20:32:43.1973880Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:43.1973947Z     
2025-05-07T20:32:43.1974076Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.1974284Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:43.1974378Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:43.1974499Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:43.1974638Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.1974707Z     
2025-05-07T20:32:43.1974805Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:43.1974809Z 
2025-05-07T20:32:43.1974903Z moe/activation_test.py:126: 
2025-05-07T20:32:43.1975023Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1975126Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:43.1975253Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.1975802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:43.1975908Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:43.1976262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.1976486Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.1976848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:43.1977101Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:43.1977521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:43.1977686Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:43.1978028Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:43.1978102Z     fn()
2025-05-07T20:32:43.1978498Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:43.1978585Z     self.fn.run(
2025-05-07T20:32:43.1978995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.1979085Z     kernel = self.compile(
2025-05-07T20:32:43.1979481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.1979648Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.1979773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1979777Z 
2025-05-07T20:32:43.1979972Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c7412a50>
2025-05-07T20:32:43.1980734Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.1981240Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c739ac00>}
2025-05-07T20:32:43.1981968Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.1982156Z context = <triton._C.libtriton.ir.context object at 0x7f13c689e870>
2025-05-07T20:32:43.1982161Z 
2025-05-07T20:32:43.1982317Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.1982575Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.1982680Z                            module_map=module_map)
2025-05-07T20:32:43.1982837Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.1983018Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:43.1983088Z E       ^
2025-05-07T20:32:43.1983437Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.1983442Z 
2025-05-07T20:32:43.1983856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.1983860Z 
2025-05-07T20:32:43.1983957Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.1984173Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.1984245Z     T=2048,
2025-05-07T20:32:43.1984319Z     D=5120,
2025-05-07T20:32:43.1984399Z     scale_ub=None,
2025-05-07T20:32:43.1984479Z     contiguous=True,
2025-05-07T20:32:43.1984554Z     compiled=True,
2025-05-07T20:32:43.1984627Z )
2025-05-07T20:32:43.1984837Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.1985007Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:43.1985012Z 
2025-05-07T20:32:43.1985090Z     @given(
2025-05-07T20:32:43.1985215Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.1985317Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.1985426Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.1985536Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.1985652Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.1985727Z     )
2025-05-07T20:32:43.1985963Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.1986054Z     def test_silu_mul_quant(
2025-05-07T20:32:43.1986128Z         self,
2025-05-07T20:32:43.1986199Z         T: int,
2025-05-07T20:32:43.1986276Z         D: int,
2025-05-07T20:32:43.1986367Z         scale_ub: Optional[float],
2025-05-07T20:32:43.1986459Z         contiguous: bool,
2025-05-07T20:32:43.1986543Z         compiled: bool,
2025-05-07T20:32:43.1986613Z     ) -> None:
2025-05-07T20:32:43.1986703Z         torch.manual_seed(2025)
2025-05-07T20:32:43.1986772Z     
2025-05-07T20:32:43.1987014Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.1987091Z     
2025-05-07T20:32:43.1987178Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.1987298Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.1987454Z         x = x_sign * x_clamp
2025-05-07T20:32:43.1987538Z         x0 = x[:, :D]
2025-05-07T20:32:43.1987632Z         x1 = x[:, D:]
2025-05-07T20:32:43.1987702Z     
2025-05-07T20:32:43.1987780Z         if contiguous:
2025-05-07T20:32:43.1987866Z             x0 = x0.contiguous()
2025-05-07T20:32:43.1987954Z             x1 = x1.contiguous()
2025-05-07T20:32:43.1988022Z     
2025-05-07T20:32:43.1988109Z         if scale_ub is not None:
2025-05-07T20:32:43.1988210Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.1988343Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.1988415Z             )
2025-05-07T20:32:43.1988487Z         else:
2025-05-07T20:32:43.1988580Z             scale_ub_tensor = None
2025-05-07T20:32:43.1988649Z     
2025-05-07T20:32:43.1988772Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.1988855Z             op = silu_mul_quant
2025-05-07T20:32:43.1988938Z             if compiled:
2025-05-07T20:32:43.1989032Z                 op = torch.compile(op)
2025-05-07T20:32:43.1989131Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.1989203Z     
2025-05-07T20:32:43.1989290Z         y_fp8, y_scale = fn()
2025-05-07T20:32:43.1989411Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:43.1989480Z     
2025-05-07T20:32:43.1989609Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.1989710Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:43.1989803Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:43.1990002Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:43.1990147Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.1990217Z     
2025-05-07T20:32:43.1990313Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:43.1990318Z 
2025-05-07T20:32:43.1990414Z moe/activation_test.py:126: 
2025-05-07T20:32:43.1990535Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1990639Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:43.1990768Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.1991317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:43.1991420Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:43.1991774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.1991998Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.1992372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:43.1992622Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:43.1992993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:43.1993153Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:43.1993489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:43.1993567Z     fn()
2025-05-07T20:32:43.1993961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:43.1994042Z     self.fn.run(
2025-05-07T20:32:43.1994379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.1994468Z     kernel = self.compile(
2025-05-07T20:32:43.1994927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.1995094Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.1995218Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.1995223Z 
2025-05-07T20:32:43.1995425Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c7bd3c50>
2025-05-07T20:32:43.1996186Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.1996681Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c77ca020>}
2025-05-07T20:32:43.1997424Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.1997611Z context = <triton._C.libtriton.ir.context object at 0x7f13c6cfc870>
2025-05-07T20:32:43.1997615Z 
2025-05-07T20:32:43.1997772Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.1998028Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.1998135Z                            module_map=module_map)
2025-05-07T20:32:43.1998291Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.1998390Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:43.1998467Z E       ^
2025-05-07T20:32:43.1998813Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.1998897Z 
2025-05-07T20:32:43.1999337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.1999341Z 
2025-05-07T20:32:43.1999438Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.1999652Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.1999730Z     T=128,
2025-05-07T20:32:43.1999808Z     D=5120,
2025-05-07T20:32:43.1999888Z     scale_ub=None,
2025-05-07T20:32:43.1999974Z     contiguous=True,
2025-05-07T20:32:43.2000049Z     compiled=True,
2025-05-07T20:32:43.2000121Z )
2025-05-07T20:32:43.2000333Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2000495Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:43.2000500Z 
2025-05-07T20:32:43.2000579Z     @given(
2025-05-07T20:32:43.2000697Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2000791Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2000909Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2001020Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2001130Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2001204Z     )
2025-05-07T20:32:43.2001445Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2001542Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2001619Z         self,
2025-05-07T20:32:43.2001696Z         T: int,
2025-05-07T20:32:43.2001781Z         D: int,
2025-05-07T20:32:43.2001876Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2001961Z         contiguous: bool,
2025-05-07T20:32:43.2002050Z         compiled: bool,
2025-05-07T20:32:43.2002128Z     ) -> None:
2025-05-07T20:32:43.2002221Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2002303Z     
2025-05-07T20:32:43.2002469Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2002538Z     
2025-05-07T20:32:43.2002717Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2002838Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2002930Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2003008Z         x0 = x[:, :D]
2025-05-07T20:32:43.2003084Z         x1 = x[:, D:]
2025-05-07T20:32:43.2003162Z     
2025-05-07T20:32:43.2003244Z         if contiguous:
2025-05-07T20:32:43.2003334Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2003428Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2003500Z     
2025-05-07T20:32:43.2003588Z         if scale_ub is not None:
2025-05-07T20:32:43.2003698Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2003833Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2003906Z             )
2025-05-07T20:32:43.2003985Z         else:
2025-05-07T20:32:43.2004079Z             scale_ub_tensor = None
2025-05-07T20:32:43.2004161Z     
2025-05-07T20:32:43.2004292Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2004388Z             op = silu_mul_quant
2025-05-07T20:32:43.2004475Z             if compiled:
2025-05-07T20:32:43.2004575Z                 op = torch.compile(op)
2025-05-07T20:32:43.2004679Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2004752Z     
2025-05-07T20:32:43.2004840Z         y_fp8, y_scale = fn()
2025-05-07T20:32:43.2004959Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:43.2005033Z     
2025-05-07T20:32:43.2005164Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2005262Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:43.2005364Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:43.2005481Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:43.2005621Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.2005799Z     
2025-05-07T20:32:43.2005896Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:43.2005900Z 
2025-05-07T20:32:43.2006006Z moe/activation_test.py:126: 
2025-05-07T20:32:43.2006131Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2006233Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:43.2006368Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.2006919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:43.2007022Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:43.2007403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2007645Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2008019Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:43.2008275Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:43.2008653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:43.2008814Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:43.2009152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:43.2009234Z     fn()
2025-05-07T20:32:43.2009629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:43.2009709Z     self.fn.run(
2025-05-07T20:32:43.2010050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2010144Z     kernel = self.compile(
2025-05-07T20:32:43.2010536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2010787Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2010911Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2010916Z 
2025-05-07T20:32:43.2011121Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c6c88e50>
2025-05-07T20:32:43.2011882Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2012378Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c679b420>}
2025-05-07T20:32:43.2013112Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2013301Z context = <triton._C.libtriton.ir.context object at 0x7f13c6570770>
2025-05-07T20:32:43.2013306Z 
2025-05-07T20:32:43.2013470Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2013726Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2013838Z                            module_map=module_map)
2025-05-07T20:32:43.2013995Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2014092Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:43.2014172Z E       ^
2025-05-07T20:32:43.2014519Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2014524Z 
2025-05-07T20:32:43.2015021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2015026Z 
2025-05-07T20:32:43.2015130Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2015350Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2015429Z     T=4096,
2025-05-07T20:32:43.2015502Z     D=5120,
2025-05-07T20:32:43.2015580Z     scale_ub=None,
2025-05-07T20:32:43.2015675Z     contiguous=True,
2025-05-07T20:32:43.2015759Z     compiled=True,
2025-05-07T20:32:43.2015831Z )
2025-05-07T20:32:43.2016053Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2016220Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:43.2016225Z 
2025-05-07T20:32:43.2016306Z     @given(
2025-05-07T20:32:43.2016421Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2016517Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2016644Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2016757Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2016871Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2016950Z     )
2025-05-07T20:32:43.2017189Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2017280Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2017366Z         self,
2025-05-07T20:32:43.2017458Z         T: int,
2025-05-07T20:32:43.2017543Z         D: int,
2025-05-07T20:32:43.2017660Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2017749Z         contiguous: bool,
2025-05-07T20:32:43.2017837Z         compiled: bool,
2025-05-07T20:32:43.2017913Z     ) -> None:
2025-05-07T20:32:43.2018004Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2018085Z     
2025-05-07T20:32:43.2018248Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2018322Z     
2025-05-07T20:32:43.2018423Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2018543Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2018714Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2018802Z         x0 = x[:, :D]
2025-05-07T20:32:43.2018878Z         x1 = x[:, D:]
2025-05-07T20:32:43.2018949Z     
2025-05-07T20:32:43.2019038Z         if contiguous:
2025-05-07T20:32:43.2019127Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2019218Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2019288Z     
2025-05-07T20:32:43.2019376Z         if scale_ub is not None:
2025-05-07T20:32:43.2019486Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2019617Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2019692Z             )
2025-05-07T20:32:43.2019770Z         else:
2025-05-07T20:32:43.2019864Z             scale_ub_tensor = None
2025-05-07T20:32:43.2019939Z     
2025-05-07T20:32:43.2020074Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2020166Z             op = silu_mul_quant
2025-05-07T20:32:43.2020249Z             if compiled:
2025-05-07T20:32:43.2020359Z                 op = torch.compile(op)
2025-05-07T20:32:43.2020462Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2020538Z     
2025-05-07T20:32:43.2020629Z         y_fp8, y_scale = fn()
2025-05-07T20:32:43.2020746Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:43.2020822Z     
2025-05-07T20:32:43.2020954Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2021052Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:43.2021159Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:43.2021275Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:43.2021409Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.2021489Z     
2025-05-07T20:32:43.2021585Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:43.2021671Z 
2025-05-07T20:32:43.2021772Z moe/activation_test.py:126: 
2025-05-07T20:32:43.2021896Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2022007Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:43.2022142Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.2022692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:43.2022789Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:43.2023148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2023365Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2023735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:43.2023985Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:43.2024367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:43.2024534Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:43.2024872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:43.2024951Z     fn()
2025-05-07T20:32:43.2025348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:43.2025430Z     self.fn.run(
2025-05-07T20:32:43.2025769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2025861Z     kernel = self.compile(
2025-05-07T20:32:43.2026237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2026417Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2026618Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2026623Z 
2025-05-07T20:32:43.2026827Z self = <triton.compiler.compiler.ASTSource object at 0x7f13f9713b50>
2025-05-07T20:32:43.2027682Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2028174Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c67eaac0>}
2025-05-07T20:32:43.2028911Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2029103Z context = <triton._C.libtriton.ir.context object at 0x7f13c5f27030>
2025-05-07T20:32:43.2029107Z 
2025-05-07T20:32:43.2029276Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2029533Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2029638Z                            module_map=module_map)
2025-05-07T20:32:43.2029800Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2029899Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:43.2029981Z E       ^
2025-05-07T20:32:43.2030330Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2030334Z 
2025-05-07T20:32:43.2030765Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2030849Z 
2025-05-07T20:32:43.2030957Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2031173Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2031262Z     T=16384,
2025-05-07T20:32:43.2031342Z     D=5120,
2025-05-07T20:32:43.2031421Z     scale_ub=None,
2025-05-07T20:32:43.2031508Z     contiguous=True,
2025-05-07T20:32:43.2031589Z     compiled=True,
2025-05-07T20:32:43.2031660Z )
2025-05-07T20:32:43.2031877Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2032049Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:43.2032053Z 
2025-05-07T20:32:43.2032127Z     @given(
2025-05-07T20:32:43.2032249Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2032346Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2032459Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2032579Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2032693Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2032773Z     )
2025-05-07T20:32:43.2033017Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2033107Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2033189Z         self,
2025-05-07T20:32:43.2033266Z         T: int,
2025-05-07T20:32:43.2033343Z         D: int,
2025-05-07T20:32:43.2033442Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2033530Z         contiguous: bool,
2025-05-07T20:32:43.2033613Z         compiled: bool,
2025-05-07T20:32:43.2033694Z     ) -> None:
2025-05-07T20:32:43.2033786Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2033855Z     
2025-05-07T20:32:43.2034023Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2034093Z     
2025-05-07T20:32:43.2034187Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2034309Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2034399Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2034483Z         x0 = x[:, :D]
2025-05-07T20:32:43.2034561Z         x1 = x[:, D:]
2025-05-07T20:32:43.2034633Z     
2025-05-07T20:32:43.2034889Z         if contiguous:
2025-05-07T20:32:43.2034981Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2035069Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2035146Z     
2025-05-07T20:32:43.2035234Z         if scale_ub is not None:
2025-05-07T20:32:43.2035334Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2035478Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2035555Z             )
2025-05-07T20:32:43.2035635Z         else:
2025-05-07T20:32:43.2035726Z             scale_ub_tensor = None
2025-05-07T20:32:43.2035799Z     
2025-05-07T20:32:43.2035933Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2036018Z             op = silu_mul_quant
2025-05-07T20:32:43.2036099Z             if compiled:
2025-05-07T20:32:43.2036205Z                 op = torch.compile(op)
2025-05-07T20:32:43.2036306Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2036379Z     
2025-05-07T20:32:43.2036478Z         y_fp8, y_scale = fn()
2025-05-07T20:32:43.2036596Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:43.2036668Z     
2025-05-07T20:32:43.2036804Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2036902Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:43.2037005Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:43.2037120Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:43.2037255Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.2037331Z     
2025-05-07T20:32:43.2037427Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:43.2037432Z 
2025-05-07T20:32:43.2037525Z moe/activation_test.py:126: 
2025-05-07T20:32:43.2037653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2037838Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:43.2037967Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.2038527Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:43.2038624Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:43.2038984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2039200Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2039561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:43.2039819Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:43.2040426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:43.2040639Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:43.2040986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:43.2041059Z     fn()
2025-05-07T20:32:43.2041461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:43.2041539Z     self.fn.run(
2025-05-07T20:32:43.2041871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2041966Z     kernel = self.compile(
2025-05-07T20:32:43.2042341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2042514Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2042637Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2042645Z 
2025-05-07T20:32:43.2042843Z self = <triton.compiler.compiler.ASTSource object at 0x7f13f8ad3850>
2025-05-07T20:32:43.2043747Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2044239Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13f8b5a520>}
2025-05-07T20:32:43.2044976Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2045162Z context = <triton._C.libtriton.ir.context object at 0x7f13c5afff70>
2025-05-07T20:32:43.2045170Z 
2025-05-07T20:32:43.2045334Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2045596Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2045698Z                            module_map=module_map)
2025-05-07T20:32:43.2045860Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2045955Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:43.2046028Z E       ^
2025-05-07T20:32:43.2046382Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2046386Z 
2025-05-07T20:32:43.2046795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2046799Z 
2025-05-07T20:32:43.2046902Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2047119Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2047318Z     T=1,
2025-05-07T20:32:43.2047397Z     D=5120,
2025-05-07T20:32:43.2047477Z     scale_ub=1200.0,
2025-05-07T20:32:43.2047563Z     contiguous=True,
2025-05-07T20:32:43.2047650Z     compiled=True,
2025-05-07T20:32:43.2047716Z )
2025-05-07T20:32:43.2047933Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2048100Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:43.2048105Z 
2025-05-07T20:32:43.2048180Z     @given(
2025-05-07T20:32:43.2048305Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2048400Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2048510Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2048632Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2048743Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2048816Z     )
2025-05-07T20:32:43.2049061Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2049159Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2049239Z         self,
2025-05-07T20:32:43.2049323Z         T: int,
2025-05-07T20:32:43.2049395Z         D: int,
2025-05-07T20:32:43.2049496Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2049585Z         contiguous: bool,
2025-05-07T20:32:43.2049665Z         compiled: bool,
2025-05-07T20:32:43.2049748Z     ) -> None:
2025-05-07T20:32:43.2049840Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2049910Z     
2025-05-07T20:32:43.2050079Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2050153Z     
2025-05-07T20:32:43.2050243Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2050371Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2050460Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2050538Z         x0 = x[:, :D]
2025-05-07T20:32:43.2050619Z         x1 = x[:, D:]
2025-05-07T20:32:43.2050695Z     
2025-05-07T20:32:43.2050782Z         if contiguous:
2025-05-07T20:32:43.2050869Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2051035Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2051112Z     
2025-05-07T20:32:43.2051197Z         if scale_ub is not None:
2025-05-07T20:32:43.2051302Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2051437Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2051510Z             )
2025-05-07T20:32:43.2051582Z         else:
2025-05-07T20:32:43.2051674Z             scale_ub_tensor = None
2025-05-07T20:32:43.2051743Z     
2025-05-07T20:32:43.2051868Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2051959Z             op = silu_mul_quant
2025-05-07T20:32:43.2052042Z             if compiled:
2025-05-07T20:32:43.2052142Z                 op = torch.compile(op)
2025-05-07T20:32:43.2052246Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2052312Z     
2025-05-07T20:32:43.2052407Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2052411Z 
2025-05-07T20:32:43.2052503Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2052633Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2052735Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2052831Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2053194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2053287Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2053771Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2053870Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2054223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2054441Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2054862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2054957Z     kernel = self.compile(
2025-05-07T20:32:43.2055339Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2055507Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2055629Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2055634Z 
2025-05-07T20:32:43.2055836Z self = <triton.compiler.compiler.ASTSource object at 0x7f13e11ad550>
2025-05-07T20:32:43.2056597Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2057105Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c5d0f1a0>}
2025-05-07T20:32:43.2057889Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2058073Z context = <triton._C.libtriton.ir.context object at 0x7f13c59777b0>
2025-05-07T20:32:43.2058078Z 
2025-05-07T20:32:43.2058242Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2058498Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2058606Z                            module_map=module_map)
2025-05-07T20:32:43.2058762Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2058857Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2058943Z E       ^
2025-05-07T20:32:43.2059290Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2059372Z 
2025-05-07T20:32:43.2059780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2059790Z 
2025-05-07T20:32:43.2059889Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2060104Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2060183Z     T=1,
2025-05-07T20:32:43.2060256Z     D=5120,
2025-05-07T20:32:43.2060336Z     scale_ub=None,
2025-05-07T20:32:43.2060427Z     contiguous=False,
2025-05-07T20:32:43.2060505Z     compiled=True,
2025-05-07T20:32:43.2060570Z )
2025-05-07T20:32:43.2060792Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2060951Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:43.2060959Z 
2025-05-07T20:32:43.2061040Z     @given(
2025-05-07T20:32:43.2061154Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2061255Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2061369Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2061481Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2061589Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2065747Z     )
2025-05-07T20:32:43.2066011Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2066106Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2066182Z         self,
2025-05-07T20:32:43.2066257Z         T: int,
2025-05-07T20:32:43.2066330Z         D: int,
2025-05-07T20:32:43.2066428Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2066513Z         contiguous: bool,
2025-05-07T20:32:43.2066596Z         compiled: bool,
2025-05-07T20:32:43.2066671Z     ) -> None:
2025-05-07T20:32:43.2066888Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2066961Z     
2025-05-07T20:32:43.2067126Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2067201Z     
2025-05-07T20:32:43.2067296Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2067481Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2067566Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2067647Z         x0 = x[:, :D]
2025-05-07T20:32:43.2067719Z         x1 = x[:, D:]
2025-05-07T20:32:43.2067786Z     
2025-05-07T20:32:43.2067872Z         if contiguous:
2025-05-07T20:32:43.2067957Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2068043Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2068118Z     
2025-05-07T20:32:43.2068206Z         if scale_ub is not None:
2025-05-07T20:32:43.2068311Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2068442Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2068519Z             )
2025-05-07T20:32:43.2068591Z         else:
2025-05-07T20:32:43.2068679Z             scale_ub_tensor = None
2025-05-07T20:32:43.2068743Z     
2025-05-07T20:32:43.2068878Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2068963Z             op = silu_mul_quant
2025-05-07T20:32:43.2069044Z             if compiled:
2025-05-07T20:32:43.2069142Z                 op = torch.compile(op)
2025-05-07T20:32:43.2069243Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2069314Z     
2025-05-07T20:32:43.2069406Z         y_fp8, y_scale = fn()
2025-05-07T20:32:43.2069521Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:43.2069592Z     
2025-05-07T20:32:43.2069722Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2069820Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:43.2069918Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:43.2070031Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:43.2070168Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.2070243Z     
2025-05-07T20:32:43.2070419Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:43.2070425Z 
2025-05-07T20:32:43.2070521Z moe/activation_test.py:126: 
2025-05-07T20:32:43.2070655Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2070755Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:43.2070889Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.2071442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:43.2071536Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:43.2071899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2072115Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2072488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:43.2072742Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:43.2073113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:43.2073278Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:43.2073614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:43.2073688Z     fn()
2025-05-07T20:32:43.2074087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:43.2074166Z     self.fn.run(
2025-05-07T20:32:43.2074502Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2074672Z     kernel = self.compile(
2025-05-07T20:32:43.2075053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2075225Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2075348Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2075353Z 
2025-05-07T20:32:43.2075555Z self = <triton.compiler.compiler.ASTSource object at 0x7f13e81864d0>
2025-05-07T20:32:43.2076315Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2076808Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c64782c0>}
2025-05-07T20:32:43.2077602Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2077785Z context = <triton._C.libtriton.ir.context object at 0x7f13c59802f0>
2025-05-07T20:32:43.2077790Z 
2025-05-07T20:32:43.2077950Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2078203Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2078306Z                            module_map=module_map)
2025-05-07T20:32:43.2078464Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2078560Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:43.2078633Z E       ^
2025-05-07T20:32:43.2078982Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2078990Z 
2025-05-07T20:32:43.2079393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2079474Z 
2025-05-07T20:32:43.2079577Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2079792Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2079861Z     T=1,
2025-05-07T20:32:43.2079937Z     D=5120,
2025-05-07T20:32:43.2080011Z     scale_ub=None,
2025-05-07T20:32:43.2080092Z     contiguous=True,
2025-05-07T20:32:43.2080170Z     compiled=False,
2025-05-07T20:32:43.2080238Z )
2025-05-07T20:32:43.2080452Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2080612Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:43.2080616Z 
2025-05-07T20:32:43.2080693Z     @given(
2025-05-07T20:32:43.2080811Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2080916Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2081026Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2081147Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2081253Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2081331Z     )
2025-05-07T20:32:43.2081567Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2081656Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2081734Z         self,
2025-05-07T20:32:43.2081808Z         T: int,
2025-05-07T20:32:43.2081882Z         D: int,
2025-05-07T20:32:43.2081976Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2082060Z         contiguous: bool,
2025-05-07T20:32:43.2082139Z         compiled: bool,
2025-05-07T20:32:43.2082224Z     ) -> None:
2025-05-07T20:32:43.2082315Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2082382Z     
2025-05-07T20:32:43.2082550Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2082702Z     
2025-05-07T20:32:43.2082793Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2082916Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2083003Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2083083Z         x0 = x[:, :D]
2025-05-07T20:32:43.2083157Z         x1 = x[:, D:]
2025-05-07T20:32:43.2083225Z     
2025-05-07T20:32:43.2083306Z         if contiguous:
2025-05-07T20:32:43.2083391Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2083473Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2083543Z     
2025-05-07T20:32:43.2083628Z         if scale_ub is not None:
2025-05-07T20:32:43.2083729Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2083866Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2083938Z             )
2025-05-07T20:32:43.2084006Z         else:
2025-05-07T20:32:43.2084099Z             scale_ub_tensor = None
2025-05-07T20:32:43.2084165Z     
2025-05-07T20:32:43.2084299Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2084384Z             op = silu_mul_quant
2025-05-07T20:32:43.2084470Z             if compiled:
2025-05-07T20:32:43.2084566Z                 op = torch.compile(op)
2025-05-07T20:32:43.2084666Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2084734Z     
2025-05-07T20:32:43.2084821Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2084825Z 
2025-05-07T20:32:43.2084918Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2085040Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2085136Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2085228Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2085722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2085814Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2086173Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2086476Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2086811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2086900Z     kernel = self.compile(
2025-05-07T20:32:43.2087300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2087492Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2087624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2087629Z 
2025-05-07T20:32:43.2087822Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c756b350>
2025-05-07T20:32:43.2088580Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2089081Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c63eeac0>}
2025-05-07T20:32:43.2089810Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2089993Z context = <triton._C.libtriton.ir.context object at 0x7f13c5146430>
2025-05-07T20:32:43.2089998Z 
2025-05-07T20:32:43.2090157Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2090414Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2090515Z                            module_map=module_map)
2025-05-07T20:32:43.2090751Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2090846Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2090925Z E       ^
2025-05-07T20:32:43.2091274Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2091279Z 
2025-05-07T20:32:43.2091711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2091715Z 
2025-05-07T20:32:43.2091812Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2092029Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2092102Z     T=128,
2025-05-07T20:32:43.2092176Z     D=5120,
2025-05-07T20:32:43.2092258Z     scale_ub=None,
2025-05-07T20:32:43.2092340Z     contiguous=False,
2025-05-07T20:32:43.2092416Z     compiled=True,
2025-05-07T20:32:43.2092488Z )
2025-05-07T20:32:43.2092703Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2092868Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:43.2092882Z 
2025-05-07T20:32:43.2092953Z     @given(
2025-05-07T20:32:43.2093065Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2093162Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2093271Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2093380Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2093489Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2093557Z     )
2025-05-07T20:32:43.2093793Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2093885Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2093957Z         self,
2025-05-07T20:32:43.2094032Z         T: int,
2025-05-07T20:32:43.2094110Z         D: int,
2025-05-07T20:32:43.2094201Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2094293Z         contiguous: bool,
2025-05-07T20:32:43.2094373Z         compiled: bool,
2025-05-07T20:32:43.2094445Z     ) -> None:
2025-05-07T20:32:43.2094622Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2094692Z     
2025-05-07T20:32:43.2094854Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2094927Z     
2025-05-07T20:32:43.2095012Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2095129Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2095214Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2095291Z         x0 = x[:, :D]
2025-05-07T20:32:43.2095367Z         x1 = x[:, D:]
2025-05-07T20:32:43.2095435Z     
2025-05-07T20:32:43.2095515Z         if contiguous:
2025-05-07T20:32:43.2095606Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2095689Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2095756Z     
2025-05-07T20:32:43.2095851Z         if scale_ub is not None:
2025-05-07T20:32:43.2095958Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2096087Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2096169Z             )
2025-05-07T20:32:43.2096242Z         else:
2025-05-07T20:32:43.2096332Z             scale_ub_tensor = None
2025-05-07T20:32:43.2096408Z     
2025-05-07T20:32:43.2096532Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2096616Z             op = silu_mul_quant
2025-05-07T20:32:43.2096703Z             if compiled:
2025-05-07T20:32:43.2096797Z                 op = torch.compile(op)
2025-05-07T20:32:43.2096904Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2096972Z     
2025-05-07T20:32:43.2097060Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2097065Z 
2025-05-07T20:32:43.2097163Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2097310Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2097415Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2097630Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2097993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2098080Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2098571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2098665Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2099018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2099233Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2099566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2099657Z     kernel = self.compile(
2025-05-07T20:32:43.2100034Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2100213Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2100341Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2100345Z 
2025-05-07T20:32:43.2100538Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c6640950>
2025-05-07T20:32:43.2101297Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2101783Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c647be20>}
2025-05-07T20:32:43.2102516Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2102779Z context = <triton._C.libtriton.ir.context object at 0x7f13c50131b0>
2025-05-07T20:32:43.2102784Z 
2025-05-07T20:32:43.2102943Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2103204Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2103306Z                            module_map=module_map)
2025-05-07T20:32:43.2103469Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2103563Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2103635Z E       ^
2025-05-07T20:32:43.2103982Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2103986Z 
2025-05-07T20:32:43.2104392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2104400Z 
2025-05-07T20:32:43.2104499Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2104717Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2104791Z     T=128,
2025-05-07T20:32:43.2104868Z     D=7168,
2025-05-07T20:32:43.2104946Z     scale_ub=1200.0,
2025-05-07T20:32:43.2105029Z     contiguous=False,
2025-05-07T20:32:43.2105112Z     compiled=False,
2025-05-07T20:32:43.2105178Z )
2025-05-07T20:32:43.2105389Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2105558Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:43.2105563Z 
2025-05-07T20:32:43.2105636Z     @given(
2025-05-07T20:32:43.2105753Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2105847Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2105957Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2106152Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2106260Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2106332Z     )
2025-05-07T20:32:43.2106572Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2106658Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2106731Z         self,
2025-05-07T20:32:43.2106806Z         T: int,
2025-05-07T20:32:43.2106881Z         D: int,
2025-05-07T20:32:43.2106972Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2107060Z         contiguous: bool,
2025-05-07T20:32:43.2107140Z         compiled: bool,
2025-05-07T20:32:43.2107215Z     ) -> None:
2025-05-07T20:32:43.2107304Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2107369Z     
2025-05-07T20:32:43.2107608Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2107678Z     
2025-05-07T20:32:43.2107765Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2107895Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2107979Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2108052Z         x0 = x[:, :D]
2025-05-07T20:32:43.2108135Z         x1 = x[:, D:]
2025-05-07T20:32:43.2108206Z     
2025-05-07T20:32:43.2108285Z         if contiguous:
2025-05-07T20:32:43.2108373Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2108456Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2108527Z     
2025-05-07T20:32:43.2108612Z         if scale_ub is not None:
2025-05-07T20:32:43.2108709Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2108841Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2108912Z             )
2025-05-07T20:32:43.2108982Z         else:
2025-05-07T20:32:43.2109072Z             scale_ub_tensor = None
2025-05-07T20:32:43.2109141Z     
2025-05-07T20:32:43.2109265Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2109351Z             op = silu_mul_quant
2025-05-07T20:32:43.2109435Z             if compiled:
2025-05-07T20:32:43.2109528Z                 op = torch.compile(op)
2025-05-07T20:32:43.2109714Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2109780Z     
2025-05-07T20:32:43.2109869Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2109874Z 
2025-05-07T20:32:43.2109966Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2110088Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2110185Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2110277Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2110764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2110859Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2111211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2111435Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2111770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2111856Z     kernel = self.compile(
2025-05-07T20:32:43.2112252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2112419Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2112539Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2112548Z 
2025-05-07T20:32:43.2112744Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c6b522d0>
2025-05-07T20:32:43.2113500Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2114095Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c647bb00>}
2025-05-07T20:32:43.2114829Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2115017Z context = <triton._C.libtriton.ir.context object at 0x7f13c5054370>
2025-05-07T20:32:43.2115021Z 
2025-05-07T20:32:43.2115176Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2115429Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2115535Z                            module_map=module_map)
2025-05-07T20:32:43.2115690Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2115781Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2115868Z E       ^
2025-05-07T20:32:43.2116220Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2116225Z 
2025-05-07T20:32:43.2116638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2116642Z 
2025-05-07T20:32:43.2116738Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2116952Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2117031Z     T=128,
2025-05-07T20:32:43.2117106Z     D=5120,
2025-05-07T20:32:43.2117191Z     scale_ub=None,
2025-05-07T20:32:43.2117272Z     contiguous=False,
2025-05-07T20:32:43.2117348Z     compiled=False,
2025-05-07T20:32:43.2117419Z )
2025-05-07T20:32:43.2117629Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2117792Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:43.2117802Z 
2025-05-07T20:32:43.2117876Z     @given(
2025-05-07T20:32:43.2118068Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2118161Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2118273Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2118384Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2118494Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2118564Z     )
2025-05-07T20:32:43.2118799Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2118892Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2118964Z         self,
2025-05-07T20:32:43.2119037Z         T: int,
2025-05-07T20:32:43.2119112Z         D: int,
2025-05-07T20:32:43.2119204Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2119285Z         contiguous: bool,
2025-05-07T20:32:43.2119370Z         compiled: bool,
2025-05-07T20:32:43.2119446Z     ) -> None:
2025-05-07T20:32:43.2119534Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2119607Z     
2025-05-07T20:32:43.2119771Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2119842Z     
2025-05-07T20:32:43.2119927Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2120045Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2120130Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2120204Z         x0 = x[:, :D]
2025-05-07T20:32:43.2120276Z         x1 = x[:, D:]
2025-05-07T20:32:43.2120342Z     
2025-05-07T20:32:43.2120420Z         if contiguous:
2025-05-07T20:32:43.2120508Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2120593Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2120659Z     
2025-05-07T20:32:43.2120741Z         if scale_ub is not None:
2025-05-07T20:32:43.2120845Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2120973Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2121133Z             )
2025-05-07T20:32:43.2121206Z         else:
2025-05-07T20:32:43.2121293Z             scale_ub_tensor = None
2025-05-07T20:32:43.2121373Z     
2025-05-07T20:32:43.2121500Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2121583Z             op = silu_mul_quant
2025-05-07T20:32:43.2121664Z             if compiled:
2025-05-07T20:32:43.2121761Z                 op = torch.compile(op)
2025-05-07T20:32:43.2121862Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2121936Z     
2025-05-07T20:32:43.2122023Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2122027Z 
2025-05-07T20:32:43.2122119Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2122244Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2122338Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2122434Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2122921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2123019Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2123379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2123595Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2123928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2124020Z     kernel = self.compile(
2025-05-07T20:32:43.2124415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2124584Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2124703Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2124707Z 
2025-05-07T20:32:43.2124908Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c63865d0>
2025-05-07T20:32:43.2125773Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2126262Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c63edb20>}
2025-05-07T20:32:43.2126998Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2127182Z context = <triton._C.libtriton.ir.context object at 0x7f13c5346bf0>
2025-05-07T20:32:43.2127186Z 
2025-05-07T20:32:43.2127349Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2127607Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2127716Z                            module_map=module_map)
2025-05-07T20:32:43.2127876Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2127968Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2128041Z E       ^
2025-05-07T20:32:43.2128387Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2128392Z 
2025-05-07T20:32:43.2128820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2128824Z 
2025-05-07T20:32:43.2128932Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2129147Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2129217Z     T=128,
2025-05-07T20:32:43.2129294Z     D=5120,
2025-05-07T20:32:43.2129455Z     scale_ub=1200.0,
2025-05-07T20:32:43.2129534Z     contiguous=True,
2025-05-07T20:32:43.2129617Z     compiled=False,
2025-05-07T20:32:43.2129685Z )
2025-05-07T20:32:43.2129901Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2130069Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:43.2130074Z 
2025-05-07T20:32:43.2130147Z     @given(
2025-05-07T20:32:43.2130262Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2130356Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2130465Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2130577Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2130683Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2130749Z     )
2025-05-07T20:32:43.2130986Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2131079Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2131147Z         self,
2025-05-07T20:32:43.2131221Z         T: int,
2025-05-07T20:32:43.2131294Z         D: int,
2025-05-07T20:32:43.2131395Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2131479Z         contiguous: bool,
2025-05-07T20:32:43.2131560Z         compiled: bool,
2025-05-07T20:32:43.2131637Z     ) -> None:
2025-05-07T20:32:43.2131727Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2131796Z     
2025-05-07T20:32:43.2131957Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2132025Z     
2025-05-07T20:32:43.2132109Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2132230Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2132313Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2132388Z         x0 = x[:, :D]
2025-05-07T20:32:43.2132466Z         x1 = x[:, D:]
2025-05-07T20:32:43.2132532Z     
2025-05-07T20:32:43.2132614Z         if contiguous:
2025-05-07T20:32:43.2132703Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2132787Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2132861Z     
2025-05-07T20:32:43.2133030Z         if scale_ub is not None:
2025-05-07T20:32:43.2133131Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2133265Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2133337Z             )
2025-05-07T20:32:43.2133411Z         else:
2025-05-07T20:32:43.2133503Z             scale_ub_tensor = None
2025-05-07T20:32:43.2133569Z     
2025-05-07T20:32:43.2133694Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2133781Z             op = silu_mul_quant
2025-05-07T20:32:43.2133859Z             if compiled:
2025-05-07T20:32:43.2133956Z                 op = torch.compile(op)
2025-05-07T20:32:43.2134055Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2134124Z     
2025-05-07T20:32:43.2134210Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2134214Z 
2025-05-07T20:32:43.2134310Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2134434Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2134536Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2134628Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2135118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2135213Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2135565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2135783Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2136110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2136198Z     kernel = self.compile(
2025-05-07T20:32:43.2136586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2136834Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2136965Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2136969Z 
2025-05-07T20:32:43.2137164Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c6b51fd0>
2025-05-07T20:32:43.2137973Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2138470Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c5d59ee0>}
2025-05-07T20:32:43.2139198Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2139399Z context = <triton._C.libtriton.ir.context object at 0x7f13c51ace30>
2025-05-07T20:32:43.2139403Z 
2025-05-07T20:32:43.2139559Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2139811Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2139915Z                            module_map=module_map)
2025-05-07T20:32:43.2140330Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2140479Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2140580Z E       ^
2025-05-07T20:32:43.2141000Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2141005Z 
2025-05-07T20:32:43.2141420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2141432Z 
2025-05-07T20:32:43.2141531Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2141903Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2141984Z     T=1,
2025-05-07T20:32:43.2142063Z     D=7168,
2025-05-07T20:32:43.2142150Z     scale_ub=1200.0,
2025-05-07T20:32:43.2142232Z     contiguous=True,
2025-05-07T20:32:43.2142312Z     compiled=True,
2025-05-07T20:32:43.2142387Z )
2025-05-07T20:32:43.2142599Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2142758Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:43.2142763Z 
2025-05-07T20:32:43.2142840Z     @given(
2025-05-07T20:32:43.2142952Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2143046Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2143161Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2143279Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2143392Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2143472Z     )
2025-05-07T20:32:43.2143711Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2143803Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2143874Z         self,
2025-05-07T20:32:43.2143947Z         T: int,
2025-05-07T20:32:43.2144025Z         D: int,
2025-05-07T20:32:43.2144118Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2144204Z         contiguous: bool,
2025-05-07T20:32:43.2144291Z         compiled: bool,
2025-05-07T20:32:43.2144367Z     ) -> None:
2025-05-07T20:32:43.2144454Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2144526Z     
2025-05-07T20:32:43.2144687Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2144763Z     
2025-05-07T20:32:43.2144850Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2145099Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2145188Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2145262Z         x0 = x[:, :D]
2025-05-07T20:32:43.2145342Z         x1 = x[:, D:]
2025-05-07T20:32:43.2145417Z     
2025-05-07T20:32:43.2145497Z         if contiguous:
2025-05-07T20:32:43.2145583Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2145668Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2145737Z     
2025-05-07T20:32:43.2145820Z         if scale_ub is not None:
2025-05-07T20:32:43.2145924Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2146054Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2146132Z             )
2025-05-07T20:32:43.2146206Z         else:
2025-05-07T20:32:43.2146294Z             scale_ub_tensor = None
2025-05-07T20:32:43.2146368Z     
2025-05-07T20:32:43.2146497Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2146583Z             op = silu_mul_quant
2025-05-07T20:32:43.2146670Z             if compiled:
2025-05-07T20:32:43.2146764Z                 op = torch.compile(op)
2025-05-07T20:32:43.2146871Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2146940Z     
2025-05-07T20:32:43.2147027Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2147032Z 
2025-05-07T20:32:43.2147121Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2147249Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2147351Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2147527Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2147912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2148001Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2148488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2148584Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2148936Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2149239Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2149575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2149666Z     kernel = self.compile(
2025-05-07T20:32:43.2150046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2150214Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2150340Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2150344Z 
2025-05-07T20:32:43.2150541Z self = <triton.compiler.compiler.ASTSource object at 0x7f13e06c8c50>
2025-05-07T20:32:43.2151310Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2151807Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c5d599e0>}
2025-05-07T20:32:43.2152545Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2152727Z context = <triton._C.libtriton.ir.context object at 0x7f13c53cf330>
2025-05-07T20:32:43.2152732Z 
2025-05-07T20:32:43.2152888Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2153149Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2153330Z                            module_map=module_map)
2025-05-07T20:32:43.2153488Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2153595Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2153665Z E       ^
2025-05-07T20:32:43.2154013Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2154018Z 
2025-05-07T20:32:43.2154424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2154428Z 
2025-05-07T20:32:43.2154523Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2154739Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2154814Z     T=1,
2025-05-07T20:32:43.2154893Z     D=7168,
2025-05-07T20:32:43.2154973Z     scale_ub=1200.0,
2025-05-07T20:32:43.2155057Z     contiguous=False,
2025-05-07T20:32:43.2155138Z     compiled=True,
2025-05-07T20:32:43.2155209Z )
2025-05-07T20:32:43.2155420Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2155593Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:43.2155597Z 
2025-05-07T20:32:43.2155670Z     @given(
2025-05-07T20:32:43.2155783Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2155878Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2155986Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2156097Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2156207Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2156277Z     )
2025-05-07T20:32:43.2156518Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2156608Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2156681Z         self,
2025-05-07T20:32:43.2156756Z         T: int,
2025-05-07T20:32:43.2156835Z         D: int,
2025-05-07T20:32:43.2156932Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2157017Z         contiguous: bool,
2025-05-07T20:32:43.2157205Z         compiled: bool,
2025-05-07T20:32:43.2157281Z     ) -> None:
2025-05-07T20:32:43.2157378Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2157448Z     
2025-05-07T20:32:43.2157610Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2157682Z     
2025-05-07T20:32:43.2157770Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2157891Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2157976Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2158052Z         x0 = x[:, :D]
2025-05-07T20:32:43.2158131Z         x1 = x[:, D:]
2025-05-07T20:32:43.2158199Z     
2025-05-07T20:32:43.2158275Z         if contiguous:
2025-05-07T20:32:43.2158363Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2158448Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2158522Z     
2025-05-07T20:32:43.2158616Z         if scale_ub is not None:
2025-05-07T20:32:43.2158716Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2158851Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2158930Z             )
2025-05-07T20:32:43.2159003Z         else:
2025-05-07T20:32:43.2159094Z             scale_ub_tensor = None
2025-05-07T20:32:43.2159162Z     
2025-05-07T20:32:43.2159288Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2159378Z             op = silu_mul_quant
2025-05-07T20:32:43.2159459Z             if compiled:
2025-05-07T20:32:43.2159553Z                 op = torch.compile(op)
2025-05-07T20:32:43.2159657Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2159725Z     
2025-05-07T20:32:43.2159810Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2159814Z 
2025-05-07T20:32:43.2159910Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2160031Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2160269Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2160362Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2160729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2160818Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2161300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2161394Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2161749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2161965Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2162302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2162391Z     kernel = self.compile(
2025-05-07T20:32:43.2162790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2162969Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2163091Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2163095Z 
2025-05-07T20:32:43.2163293Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c67bda50>
2025-05-07T20:32:43.2164057Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2164549Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c5d59d00>}
2025-05-07T20:32:43.2165360Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2165551Z context = <triton._C.libtriton.ir.context object at 0x7f13c53525b0>
2025-05-07T20:32:43.2165555Z 
2025-05-07T20:32:43.2165715Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2165970Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2166073Z                            module_map=module_map)
2025-05-07T20:32:43.2166233Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2166325Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2166400Z E       ^
2025-05-07T20:32:43.2166748Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2166753Z 
2025-05-07T20:32:43.2167188Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2167193Z 
2025-05-07T20:32:43.2167303Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2167519Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2167593Z     T=1,
2025-05-07T20:32:43.2167673Z     D=7168,
2025-05-07T20:32:43.2167752Z     scale_ub=None,
2025-05-07T20:32:43.2167836Z     contiguous=False,
2025-05-07T20:32:43.2167919Z     compiled=True,
2025-05-07T20:32:43.2167988Z )
2025-05-07T20:32:43.2168202Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2168361Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:43.2168365Z 
2025-05-07T20:32:43.2168438Z     @given(
2025-05-07T20:32:43.2168557Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2168651Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2168838Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2168950Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2169063Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2169136Z     )
2025-05-07T20:32:43.2169373Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2169459Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2169535Z         self,
2025-05-07T20:32:43.2169608Z         T: int,
2025-05-07T20:32:43.2169680Z         D: int,
2025-05-07T20:32:43.2169776Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2169862Z         contiguous: bool,
2025-05-07T20:32:43.2169942Z         compiled: bool,
2025-05-07T20:32:43.2170018Z     ) -> None:
2025-05-07T20:32:43.2170108Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2170177Z     
2025-05-07T20:32:43.2170341Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2170409Z     
2025-05-07T20:32:43.2170503Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2170626Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2170708Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2170792Z         x0 = x[:, :D]
2025-05-07T20:32:43.2170868Z         x1 = x[:, D:]
2025-05-07T20:32:43.2170939Z     
2025-05-07T20:32:43.2171023Z         if contiguous:
2025-05-07T20:32:43.2171109Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2171191Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2171258Z     
2025-05-07T20:32:43.2171344Z         if scale_ub is not None:
2025-05-07T20:32:43.2171444Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2171576Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2171650Z             )
2025-05-07T20:32:43.2171723Z         else:
2025-05-07T20:32:43.2171814Z             scale_ub_tensor = None
2025-05-07T20:32:43.2171884Z     
2025-05-07T20:32:43.2172014Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2172104Z             op = silu_mul_quant
2025-05-07T20:32:43.2172186Z             if compiled:
2025-05-07T20:32:43.2172284Z                 op = torch.compile(op)
2025-05-07T20:32:43.2172466Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2172536Z     
2025-05-07T20:32:43.2172628Z         y_fp8, y_scale = fn()
2025-05-07T20:32:43.2172745Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:43.2172814Z     
2025-05-07T20:32:43.2172951Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2173048Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:43.2173142Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:43.2173260Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:43.2173394Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.2173467Z     
2025-05-07T20:32:43.2173563Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:43.2173573Z 
2025-05-07T20:32:43.2173666Z moe/activation_test.py:126: 
2025-05-07T20:32:43.2173793Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2173900Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:43.2174030Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:43.2174582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:43.2174679Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:43.2175037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2175253Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2175613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:43.2175864Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:43.2176316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:43.2176481Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:43.2176819Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:43.2176892Z     fn()
2025-05-07T20:32:43.2177290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:43.2177370Z     self.fn.run(
2025-05-07T20:32:43.2177701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2177795Z     kernel = self.compile(
2025-05-07T20:32:43.2178169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2178349Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2178478Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2178482Z 
2025-05-07T20:32:43.2178678Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c774b5d0>
2025-05-07T20:32:43.2179440Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2179928Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c53ac2c0>}
2025-05-07T20:32:43.2180659Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2180848Z context = <triton._C.libtriton.ir.context object at 0x7f13c53c6330>
2025-05-07T20:32:43.2180852Z 
2025-05-07T20:32:43.2181084Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2181346Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2181450Z                            module_map=module_map)
2025-05-07T20:32:43.2181611Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2181709Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:43.2181782Z E       ^
2025-05-07T20:32:43.2182133Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2182139Z 
2025-05-07T20:32:43.2182568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2182572Z 
2025-05-07T20:32:43.2182677Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2186537Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2186640Z     T=1,
2025-05-07T20:32:43.2186720Z     D=5120,
2025-05-07T20:32:43.2186804Z     scale_ub=1200.0,
2025-05-07T20:32:43.2186892Z     contiguous=False,
2025-05-07T20:32:43.2186980Z     compiled=True,
2025-05-07T20:32:43.2187051Z )
2025-05-07T20:32:43.2187283Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2187516Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:43.2187522Z 
2025-05-07T20:32:43.2187600Z     @given(
2025-05-07T20:32:43.2187722Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2187820Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2187933Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2188052Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2188295Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2188369Z     )
2025-05-07T20:32:43.2188620Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2188713Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2188790Z         self,
2025-05-07T20:32:43.2188873Z         T: int,
2025-05-07T20:32:43.2188952Z         D: int,
2025-05-07T20:32:43.2189049Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2189141Z         contiguous: bool,
2025-05-07T20:32:43.2189226Z         compiled: bool,
2025-05-07T20:32:43.2189310Z     ) -> None:
2025-05-07T20:32:43.2189404Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2189479Z     
2025-05-07T20:32:43.2189647Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2189719Z     
2025-05-07T20:32:43.2189809Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2189934Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2190028Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2190111Z         x0 = x[:, :D]
2025-05-07T20:32:43.2190197Z         x1 = x[:, D:]
2025-05-07T20:32:43.2190268Z     
2025-05-07T20:32:43.2190359Z         if contiguous:
2025-05-07T20:32:43.2190455Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2190545Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2190621Z     
2025-05-07T20:32:43.2190713Z         if scale_ub is not None:
2025-05-07T20:32:43.2190821Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2190959Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2191035Z             )
2025-05-07T20:32:43.2191112Z         else:
2025-05-07T20:32:43.2191211Z             scale_ub_tensor = None
2025-05-07T20:32:43.2191286Z     
2025-05-07T20:32:43.2191415Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2191511Z             op = silu_mul_quant
2025-05-07T20:32:43.2191595Z             if compiled:
2025-05-07T20:32:43.2191693Z                 op = torch.compile(op)
2025-05-07T20:32:43.2191805Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2191878Z     
2025-05-07T20:32:43.2192047Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2192056Z 
2025-05-07T20:32:43.2192153Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2192276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2192376Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2192476Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2192844Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2192937Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2193423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2193520Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2193874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2194100Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2194438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2194528Z     kernel = self.compile(
2025-05-07T20:32:43.2194907Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2195081Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2195202Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2195207Z 
2025-05-07T20:32:43.2195409Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c4ecee50>
2025-05-07T20:32:43.2196169Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2196744Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c589f9c0>}
2025-05-07T20:32:43.2197474Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2197657Z context = <triton._C.libtriton.ir.context object at 0x7f13c6d2e270>
2025-05-07T20:32:43.2197662Z 
2025-05-07T20:32:43.2197823Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2198077Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2198177Z                            module_map=module_map)
2025-05-07T20:32:43.2198341Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2198435Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2198509Z E       ^
2025-05-07T20:32:43.2198861Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2198866Z 
2025-05-07T20:32:43.2199276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2199280Z 
2025-05-07T20:32:43.2199379Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2199594Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2199667Z     T=1,
2025-05-07T20:32:43.2199741Z     D=5120,
2025-05-07T20:32:43.2199817Z     scale_ub=1200.0,
2025-05-07T20:32:43.2199899Z     contiguous=False,
2025-05-07T20:32:43.2199976Z     compiled=False,
2025-05-07T20:32:43.2200044Z )
2025-05-07T20:32:43.2200256Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2200426Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:43.2200430Z 
2025-05-07T20:32:43.2200581Z     @given(
2025-05-07T20:32:43.2200700Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2200798Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2200908Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2201019Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2201127Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2201200Z     )
2025-05-07T20:32:43.2201435Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2201523Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2201602Z         self,
2025-05-07T20:32:43.2201675Z         T: int,
2025-05-07T20:32:43.2201748Z         D: int,
2025-05-07T20:32:43.2201841Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2201932Z         contiguous: bool,
2025-05-07T20:32:43.2202012Z         compiled: bool,
2025-05-07T20:32:43.2202095Z     ) -> None:
2025-05-07T20:32:43.2202194Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2202267Z     
2025-05-07T20:32:43.2202428Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2202497Z     
2025-05-07T20:32:43.2202587Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2202705Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2202790Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2202873Z         x0 = x[:, :D]
2025-05-07T20:32:43.2202949Z         x1 = x[:, D:]
2025-05-07T20:32:43.2203019Z     
2025-05-07T20:32:43.2203102Z         if contiguous:
2025-05-07T20:32:43.2203188Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2203272Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2203339Z     
2025-05-07T20:32:43.2203424Z         if scale_ub is not None:
2025-05-07T20:32:43.2203529Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2203747Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2203824Z             )
2025-05-07T20:32:43.2203904Z         else:
2025-05-07T20:32:43.2203995Z             scale_ub_tensor = None
2025-05-07T20:32:43.2204063Z     
2025-05-07T20:32:43.2204192Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2204278Z             op = silu_mul_quant
2025-05-07T20:32:43.2204359Z             if compiled:
2025-05-07T20:32:43.2204455Z                 op = torch.compile(op)
2025-05-07T20:32:43.2204555Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2204624Z     
2025-05-07T20:32:43.2204714Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2204719Z 
2025-05-07T20:32:43.2204810Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2204935Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2205031Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2205135Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2205633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2205725Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2206078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2206294Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2206628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2206720Z     kernel = self.compile(
2025-05-07T20:32:43.2207098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2207289Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2207431Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2207441Z 
2025-05-07T20:32:43.2207641Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c7ebe550>
2025-05-07T20:32:43.2208482Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2208973Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c6b9a3e0>}
2025-05-07T20:32:43.2209702Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2209887Z context = <triton._C.libtriton.ir.context object at 0x7f13c524c070>
2025-05-07T20:32:43.2209895Z 
2025-05-07T20:32:43.2210053Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2210313Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2210416Z                            module_map=module_map)
2025-05-07T20:32:43.2210572Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2210669Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2210743Z E       ^
2025-05-07T20:32:43.2211090Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2211100Z 
2025-05-07T20:32:43.2211507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2211511Z 
2025-05-07T20:32:43.2211608Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2211827Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2211980Z     T=16384,
2025-05-07T20:32:43.2212052Z     D=5120,
2025-05-07T20:32:43.2212136Z     scale_ub=1200.0,
2025-05-07T20:32:43.2212223Z     contiguous=False,
2025-05-07T20:32:43.2212302Z     compiled=True,
2025-05-07T20:32:43.2212376Z )
2025-05-07T20:32:43.2212585Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2212758Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:43.2212763Z 
2025-05-07T20:32:43.2212835Z     @given(
2025-05-07T20:32:43.2212947Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2213045Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2213153Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2213263Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2213372Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2213442Z     )
2025-05-07T20:32:43.2213685Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2213774Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2213850Z         self,
2025-05-07T20:32:43.2213935Z         T: int,
2025-05-07T20:32:43.2214010Z         D: int,
2025-05-07T20:32:43.2214102Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2214190Z         contiguous: bool,
2025-05-07T20:32:43.2214272Z         compiled: bool,
2025-05-07T20:32:43.2214346Z     ) -> None:
2025-05-07T20:32:43.2214436Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2214506Z     
2025-05-07T20:32:43.2214667Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2214738Z     
2025-05-07T20:32:43.2214826Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2214943Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2215029Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2215107Z         x0 = x[:, :D]
2025-05-07T20:32:43.2215187Z         x1 = x[:, D:]
2025-05-07T20:32:43.2215259Z     
2025-05-07T20:32:43.2215339Z         if contiguous:
2025-05-07T20:32:43.2215429Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2215601Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2215671Z     
2025-05-07T20:32:43.2215762Z         if scale_ub is not None:
2025-05-07T20:32:43.2215864Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2215993Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2216068Z             )
2025-05-07T20:32:43.2216140Z         else:
2025-05-07T20:32:43.2216230Z             scale_ub_tensor = None
2025-05-07T20:32:43.2216306Z     
2025-05-07T20:32:43.2216430Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2216517Z             op = silu_mul_quant
2025-05-07T20:32:43.2216595Z             if compiled:
2025-05-07T20:32:43.2216689Z                 op = torch.compile(op)
2025-05-07T20:32:43.2216796Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2216868Z     
2025-05-07T20:32:43.2216953Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2216957Z 
2025-05-07T20:32:43.2217051Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2217180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2217280Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2217398Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2217782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2217875Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2218357Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2218446Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2218798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2219011Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2219453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2219544Z     kernel = self.compile(
2025-05-07T20:32:43.2219921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2220092Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2220212Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2220216Z 
2025-05-07T20:32:43.2220410Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c67be050>
2025-05-07T20:32:43.2221171Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2221668Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c63c3560>}
2025-05-07T20:32:43.2222406Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2222588Z context = <triton._C.libtriton.ir.context object at 0x7f13c4ec80f0>
2025-05-07T20:32:43.2222593Z 
2025-05-07T20:32:43.2222753Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2223005Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2223107Z                            module_map=module_map)
2025-05-07T20:32:43.2223265Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2223360Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2223440Z E       ^
2025-05-07T20:32:43.2223869Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2223874Z 
2025-05-07T20:32:43.2224281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2224286Z 
2025-05-07T20:32:43.2224388Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2224604Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2224676Z     T=2048,
2025-05-07T20:32:43.2224753Z     D=7168,
2025-05-07T20:32:43.2224835Z     scale_ub=1200.0,
2025-05-07T20:32:43.2224916Z     contiguous=False,
2025-05-07T20:32:43.2224997Z     compiled=True,
2025-05-07T20:32:43.2225065Z )
2025-05-07T20:32:43.2225275Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2225445Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:43.2225452Z 
2025-05-07T20:32:43.2225525Z     @given(
2025-05-07T20:32:43.2225642Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2225742Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2225852Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2225965Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2226073Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2226143Z     )
2025-05-07T20:32:43.2226382Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2226468Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2226543Z         self,
2025-05-07T20:32:43.2226614Z         T: int,
2025-05-07T20:32:43.2226685Z         D: int,
2025-05-07T20:32:43.2226779Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2226863Z         contiguous: bool,
2025-05-07T20:32:43.2226942Z         compiled: bool,
2025-05-07T20:32:43.2227100Z     ) -> None:
2025-05-07T20:32:43.2227189Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2227259Z     
2025-05-07T20:32:43.2227483Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2227551Z     
2025-05-07T20:32:43.2227638Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2227759Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2227844Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2227916Z         x0 = x[:, :D]
2025-05-07T20:32:43.2227995Z         x1 = x[:, D:]
2025-05-07T20:32:43.2228063Z     
2025-05-07T20:32:43.2228145Z         if contiguous:
2025-05-07T20:32:43.2228230Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2228313Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2228385Z     
2025-05-07T20:32:43.2228471Z         if scale_ub is not None:
2025-05-07T20:32:43.2228567Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2228698Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2228775Z             )
2025-05-07T20:32:43.2228846Z         else:
2025-05-07T20:32:43.2228937Z             scale_ub_tensor = None
2025-05-07T20:32:43.2229007Z     
2025-05-07T20:32:43.2229134Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2229218Z             op = silu_mul_quant
2025-05-07T20:32:43.2229297Z             if compiled:
2025-05-07T20:32:43.2229393Z                 op = torch.compile(op)
2025-05-07T20:32:43.2229490Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2229558Z     
2025-05-07T20:32:43.2229643Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2229647Z 
2025-05-07T20:32:43.2229738Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2229861Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2229958Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2230052Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2230414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2230510Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2231073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2231170Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2231520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2231734Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2232068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2232153Z     kernel = self.compile(
2025-05-07T20:32:43.2232534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2232699Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2232822Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2232827Z 
2025-05-07T20:32:43.2233035Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c6b53d50>
2025-05-07T20:32:43.2233794Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2234290Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c63428e0>}
2025-05-07T20:32:43.2235020Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2235280Z context = <triton._C.libtriton.ir.context object at 0x7f13c578a530>
2025-05-07T20:32:43.2235284Z 
2025-05-07T20:32:43.2235443Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2235697Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2235802Z                            module_map=module_map)
2025-05-07T20:32:43.2235958Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2236051Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2236127Z E       ^
2025-05-07T20:32:43.2236470Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2236474Z 
2025-05-07T20:32:43.2236875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2236884Z 
2025-05-07T20:32:43.2236981Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2237200Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2237276Z     T=1,
2025-05-07T20:32:43.2237347Z     D=5120,
2025-05-07T20:32:43.2237429Z     scale_ub=None,
2025-05-07T20:32:43.2237511Z     contiguous=False,
2025-05-07T20:32:43.2237591Z     compiled=False,
2025-05-07T20:32:43.2237659Z )
2025-05-07T20:32:43.2237870Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2238031Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:43.2238036Z 
2025-05-07T20:32:43.2238107Z     @given(
2025-05-07T20:32:43.2238220Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2238315Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2238424Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2238533Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2238639Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2238720Z     )
2025-05-07T20:32:43.2238955Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2239119Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2239196Z         self,
2025-05-07T20:32:43.2239271Z         T: int,
2025-05-07T20:32:43.2239340Z         D: int,
2025-05-07T20:32:43.2239437Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2239518Z         contiguous: bool,
2025-05-07T20:32:43.2239599Z         compiled: bool,
2025-05-07T20:32:43.2239673Z     ) -> None:
2025-05-07T20:32:43.2239759Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2239833Z     
2025-05-07T20:32:43.2239991Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2240249Z     
2025-05-07T20:32:43.2240388Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2240562Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2240690Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2240812Z         x0 = x[:, :D]
2025-05-07T20:32:43.2240919Z         x1 = x[:, D:]
2025-05-07T20:32:43.2241014Z     
2025-05-07T20:32:43.2241133Z         if contiguous:
2025-05-07T20:32:43.2241254Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2241343Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2241412Z     
2025-05-07T20:32:43.2241497Z         if scale_ub is not None:
2025-05-07T20:32:43.2241599Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2241727Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2241795Z             )
2025-05-07T20:32:43.2241870Z         else:
2025-05-07T20:32:43.2241958Z             scale_ub_tensor = None
2025-05-07T20:32:43.2242023Z     
2025-05-07T20:32:43.2242151Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2242238Z             op = silu_mul_quant
2025-05-07T20:32:43.2242317Z             if compiled:
2025-05-07T20:32:43.2242419Z                 op = torch.compile(op)
2025-05-07T20:32:43.2242672Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2242743Z     
2025-05-07T20:32:43.2242828Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2242833Z 
2025-05-07T20:32:43.2242927Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2243051Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2243144Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2243236Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2243726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2243818Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2244172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2244387Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2244719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2244813Z     kernel = self.compile(
2025-05-07T20:32:43.2245212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2245379Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2245503Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2245507Z 
2025-05-07T20:32:43.2245703Z self = <triton.compiler.compiler.ASTSource object at 0x7f13e03db4d0>
2025-05-07T20:32:43.2246466Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2246953Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c63434c0>}
2025-05-07T20:32:43.2247830Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2248015Z context = <triton._C.libtriton.ir.context object at 0x7f13c5703930>
2025-05-07T20:32:43.2248020Z 
2025-05-07T20:32:43.2248178Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2248433Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2248534Z                            module_map=module_map)
2025-05-07T20:32:43.2248690Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2248784Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2248861Z E       ^
2025-05-07T20:32:43.2249215Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2249225Z 
2025-05-07T20:32:43.2249659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2249664Z 
2025-05-07T20:32:43.2249759Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2249977Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2250049Z     T=4096,
2025-05-07T20:32:43.2250125Z     D=7168,
2025-05-07T20:32:43.2250204Z     scale_ub=1200.0,
2025-05-07T20:32:43.2250282Z     contiguous=False,
2025-05-07T20:32:43.2250359Z     compiled=False,
2025-05-07T20:32:43.2250426Z )
2025-05-07T20:32:43.2250637Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2250808Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:43.2250813Z 
2025-05-07T20:32:43.2250885Z     @given(
2025-05-07T20:32:43.2250996Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2251171Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2251285Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2251397Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2251502Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2251572Z     )
2025-05-07T20:32:43.2251808Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2251895Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2251965Z         self,
2025-05-07T20:32:43.2252039Z         T: int,
2025-05-07T20:32:43.2252108Z         D: int,
2025-05-07T20:32:43.2252201Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2252286Z         contiguous: bool,
2025-05-07T20:32:43.2252365Z         compiled: bool,
2025-05-07T20:32:43.2252440Z     ) -> None:
2025-05-07T20:32:43.2252530Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2252595Z     
2025-05-07T20:32:43.2252762Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2252833Z     
2025-05-07T20:32:43.2252918Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2253043Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2253125Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2253199Z         x0 = x[:, :D]
2025-05-07T20:32:43.2253274Z         x1 = x[:, D:]
2025-05-07T20:32:43.2253339Z     
2025-05-07T20:32:43.2253416Z         if contiguous:
2025-05-07T20:32:43.2253506Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2253590Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2253655Z     
2025-05-07T20:32:43.2253744Z         if scale_ub is not None:
2025-05-07T20:32:43.2253844Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2253971Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2254045Z             )
2025-05-07T20:32:43.2254116Z         else:
2025-05-07T20:32:43.2254207Z             scale_ub_tensor = None
2025-05-07T20:32:43.2254277Z     
2025-05-07T20:32:43.2254397Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2254564Z             op = silu_mul_quant
2025-05-07T20:32:43.2254645Z             if compiled:
2025-05-07T20:32:43.2254738Z                 op = torch.compile(op)
2025-05-07T20:32:43.2254839Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2254902Z     
2025-05-07T20:32:43.2254985Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2254989Z 
2025-05-07T20:32:43.2255082Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2255203Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2255301Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2255395Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2255877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2255976Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2256326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2256547Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2256884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2256974Z     kernel = self.compile(
2025-05-07T20:32:43.2257376Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2257542Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2257662Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2257666Z 
2025-05-07T20:32:43.2257865Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c6098350>
2025-05-07T20:32:43.2258632Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2259204Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c53ad080>}
2025-05-07T20:32:43.2259934Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2260116Z context = <triton._C.libtriton.ir.context object at 0x7f13c52b1630>
2025-05-07T20:32:43.2260123Z 
2025-05-07T20:32:43.2260278Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2260529Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2260638Z                            module_map=module_map)
2025-05-07T20:32:43.2260792Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2260888Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2260962Z E       ^
2025-05-07T20:32:43.2261306Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2261310Z 
2025-05-07T20:32:43.2261744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2261748Z 
2025-05-07T20:32:43.2261846Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2262059Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2262137Z     T=16384,
2025-05-07T20:32:43.2262210Z     D=7168,
2025-05-07T20:32:43.2262289Z     scale_ub=None,
2025-05-07T20:32:43.2262372Z     contiguous=True,
2025-05-07T20:32:43.2262447Z     compiled=True,
2025-05-07T20:32:43.2262518Z )
2025-05-07T20:32:43.2262730Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2262970Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:43.2262975Z 
2025-05-07T20:32:43.2263051Z     @given(
2025-05-07T20:32:43.2263165Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2263262Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2263379Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2263492Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2263601Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2263674Z     )
2025-05-07T20:32:43.2263909Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2264001Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2264075Z         self,
2025-05-07T20:32:43.2264149Z         T: int,
2025-05-07T20:32:43.2264229Z         D: int,
2025-05-07T20:32:43.2264320Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2264404Z         contiguous: bool,
2025-05-07T20:32:43.2264485Z         compiled: bool,
2025-05-07T20:32:43.2264563Z     ) -> None:
2025-05-07T20:32:43.2264649Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2264722Z     
2025-05-07T20:32:43.2264881Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2264952Z     
2025-05-07T20:32:43.2265040Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2265157Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2265242Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2265324Z         x0 = x[:, :D]
2025-05-07T20:32:43.2265396Z         x1 = x[:, D:]
2025-05-07T20:32:43.2265465Z     
2025-05-07T20:32:43.2265541Z         if contiguous:
2025-05-07T20:32:43.2265624Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2265707Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2265772Z     
2025-05-07T20:32:43.2265936Z         if scale_ub is not None:
2025-05-07T20:32:43.2266040Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2266174Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2266242Z             )
2025-05-07T20:32:43.2266319Z         else:
2025-05-07T20:32:43.2266408Z             scale_ub_tensor = None
2025-05-07T20:32:43.2266475Z     
2025-05-07T20:32:43.2266600Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2266682Z             op = silu_mul_quant
2025-05-07T20:32:43.2266764Z             if compiled:
2025-05-07T20:32:43.2266855Z                 op = torch.compile(op)
2025-05-07T20:32:43.2266956Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2267026Z     
2025-05-07T20:32:43.2267109Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2267114Z 
2025-05-07T20:32:43.2267205Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2267332Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2267484Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2267577Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2267945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2268033Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2268519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2268610Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2268959Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2269179Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2269511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2269599Z     kernel = self.compile(
2025-05-07T20:32:43.2269979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2270227Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2270350Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2270354Z 
2025-05-07T20:32:43.2270552Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c4ececd0>
2025-05-07T20:32:43.2271312Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2271802Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c53ae2a0>}
2025-05-07T20:32:43.2272530Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2272726Z context = <triton._C.libtriton.ir.context object at 0x7f13c52880b0>
2025-05-07T20:32:43.2272731Z 
2025-05-07T20:32:43.2272886Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2273146Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2273247Z                            module_map=module_map)
2025-05-07T20:32:43.2273400Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2273493Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2273563Z E       ^
2025-05-07T20:32:43.2273907Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2273912Z 
2025-05-07T20:32:43.2274319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2274400Z 
2025-05-07T20:32:43.2274503Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2274726Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2274799Z     T=4096,
2025-05-07T20:32:43.2274867Z     D=5120,
2025-05-07T20:32:43.2274946Z     scale_ub=None,
2025-05-07T20:32:43.2275026Z     contiguous=False,
2025-05-07T20:32:43.2275100Z     compiled=True,
2025-05-07T20:32:43.2275170Z )
2025-05-07T20:32:43.2275379Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2275543Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:43.2275551Z 
2025-05-07T20:32:43.2275621Z     @given(
2025-05-07T20:32:43.2275731Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2275828Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2275939Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2276051Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2276169Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2276240Z     )
2025-05-07T20:32:43.2276474Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2276564Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2276633Z         self,
2025-05-07T20:32:43.2276704Z         T: int,
2025-05-07T20:32:43.2276778Z         D: int,
2025-05-07T20:32:43.2276868Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2276954Z         contiguous: bool,
2025-05-07T20:32:43.2277034Z         compiled: bool,
2025-05-07T20:32:43.2277104Z     ) -> None:
2025-05-07T20:32:43.2277194Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2277260Z     
2025-05-07T20:32:43.2277430Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2277519Z     
2025-05-07T20:32:43.2277617Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2277752Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2277838Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2278017Z         x0 = x[:, :D]
2025-05-07T20:32:43.2278091Z         x1 = x[:, D:]
2025-05-07T20:32:43.2278158Z     
2025-05-07T20:32:43.2278237Z         if contiguous:
2025-05-07T20:32:43.2278326Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2278408Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2278477Z     
2025-05-07T20:32:43.2278563Z         if scale_ub is not None:
2025-05-07T20:32:43.2278664Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2278791Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2278862Z             )
2025-05-07T20:32:43.2278932Z         else:
2025-05-07T20:32:43.2279019Z             scale_ub_tensor = None
2025-05-07T20:32:43.2279094Z     
2025-05-07T20:32:43.2279216Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2279305Z             op = silu_mul_quant
2025-05-07T20:32:43.2279386Z             if compiled:
2025-05-07T20:32:43.2279479Z                 op = torch.compile(op)
2025-05-07T20:32:43.2279586Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2279652Z     
2025-05-07T20:32:43.2279735Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2279739Z 
2025-05-07T20:32:43.2279831Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2279953Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2280044Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2280137Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2280494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2280583Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2281069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2281242Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2281601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2281817Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2282149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2282245Z     kernel = self.compile(
2025-05-07T20:32:43.2282621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2282791Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2282910Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2282914Z 
2025-05-07T20:32:43.2283108Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c6098ad0>
2025-05-07T20:32:43.2283880Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2284368Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c53aefc0>}
2025-05-07T20:32:43.2285099Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2285280Z context = <triton._C.libtriton.ir.context object at 0x7f13c4ff01b0>
2025-05-07T20:32:43.2285285Z 
2025-05-07T20:32:43.2285441Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2285702Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2285806Z                            module_map=module_map)
2025-05-07T20:32:43.2286041Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2286135Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2286208Z E       ^
2025-05-07T20:32:43.2286561Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2286565Z 
2025-05-07T20:32:43.2286971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2286975Z 
2025-05-07T20:32:43.2287074Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2287288Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2287358Z     T=4096,
2025-05-07T20:32:43.2287430Z     D=5120,
2025-05-07T20:32:43.2287506Z     scale_ub=1200.0,
2025-05-07T20:32:43.2287589Z     contiguous=False,
2025-05-07T20:32:43.2287670Z     compiled=False,
2025-05-07T20:32:43.2287735Z )
2025-05-07T20:32:43.2287951Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2288124Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:43.2288128Z 
2025-05-07T20:32:43.2288199Z     @given(
2025-05-07T20:32:43.2288313Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2288404Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2288512Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2288624Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2288731Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2288798Z     )
2025-05-07T20:32:43.2289036Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2289122Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2289194Z         self,
2025-05-07T20:32:43.2289415Z         T: int,
2025-05-07T20:32:43.2289483Z         D: int,
2025-05-07T20:32:43.2289576Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2289668Z         contiguous: bool,
2025-05-07T20:32:43.2289749Z         compiled: bool,
2025-05-07T20:32:43.2289822Z     ) -> None:
2025-05-07T20:32:43.2289909Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2289975Z     
2025-05-07T20:32:43.2290139Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2290208Z     
2025-05-07T20:32:43.2290293Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2290415Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2290496Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2290571Z         x0 = x[:, :D]
2025-05-07T20:32:43.2290649Z         x1 = x[:, D:]
2025-05-07T20:32:43.2290714Z     
2025-05-07T20:32:43.2290790Z         if contiguous:
2025-05-07T20:32:43.2290879Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2290959Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2291033Z     
2025-05-07T20:32:43.2291114Z         if scale_ub is not None:
2025-05-07T20:32:43.2291220Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2291350Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2291422Z             )
2025-05-07T20:32:43.2291493Z         else:
2025-05-07T20:32:43.2291583Z             scale_ub_tensor = None
2025-05-07T20:32:43.2291647Z     
2025-05-07T20:32:43.2291770Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2291856Z             op = silu_mul_quant
2025-05-07T20:32:43.2291936Z             if compiled:
2025-05-07T20:32:43.2292029Z                 op = torch.compile(op)
2025-05-07T20:32:43.2292131Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2292200Z     
2025-05-07T20:32:43.2292290Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2292294Z 
2025-05-07T20:32:43.2292385Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2292512Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2292607Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2292777Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2293266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2293360Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2293714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2293933Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2294261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2294347Z     kernel = self.compile(
2025-05-07T20:32:43.2294745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2294917Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2295041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2295048Z 
2025-05-07T20:32:43.2295243Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c6640bd0>
2025-05-07T20:32:43.2296006Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2296493Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c4ca8360>}
2025-05-07T20:32:43.2297224Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2297490Z context = <triton._C.libtriton.ir.context object at 0x7f13c4c31430>
2025-05-07T20:32:43.2297495Z 
2025-05-07T20:32:43.2297658Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2297910Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2298016Z                            module_map=module_map)
2025-05-07T20:32:43.2298168Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2298258Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2298335Z E       ^
2025-05-07T20:32:43.2298678Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2298683Z 
2025-05-07T20:32:43.2299107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2299117Z 
2025-05-07T20:32:43.2299213Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2299425Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2299505Z     T=4096,
2025-05-07T20:32:43.2299576Z     D=5120,
2025-05-07T20:32:43.2299654Z     scale_ub=1200.0,
2025-05-07T20:32:43.2299735Z     contiguous=False,
2025-05-07T20:32:43.2299811Z     compiled=True,
2025-05-07T20:32:43.2299878Z )
2025-05-07T20:32:43.2300086Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2300253Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:43.2300258Z 
2025-05-07T20:32:43.2300334Z     @given(
2025-05-07T20:32:43.2300445Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2300539Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2300651Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2300763Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2300877Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2304503Z     )
2025-05-07T20:32:43.2304861Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2304955Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2305028Z         self,
2025-05-07T20:32:43.2305100Z         T: int,
2025-05-07T20:32:43.2305172Z         D: int,
2025-05-07T20:32:43.2305264Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2305349Z         contiguous: bool,
2025-05-07T20:32:43.2305430Z         compiled: bool,
2025-05-07T20:32:43.2305503Z     ) -> None:
2025-05-07T20:32:43.2305591Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2305662Z     
2025-05-07T20:32:43.2305825Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2305890Z     
2025-05-07T20:32:43.2305984Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2306105Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2306203Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2306274Z         x0 = x[:, :D]
2025-05-07T20:32:43.2306348Z         x1 = x[:, D:]
2025-05-07T20:32:43.2306427Z     
2025-05-07T20:32:43.2306503Z         if contiguous:
2025-05-07T20:32:43.2306585Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2306671Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2306739Z     
2025-05-07T20:32:43.2306826Z         if scale_ub is not None:
2025-05-07T20:32:43.2306930Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2307059Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2307125Z             )
2025-05-07T20:32:43.2307200Z         else:
2025-05-07T20:32:43.2307286Z             scale_ub_tensor = None
2025-05-07T20:32:43.2307351Z     
2025-05-07T20:32:43.2307560Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2307644Z             op = silu_mul_quant
2025-05-07T20:32:43.2307727Z             if compiled:
2025-05-07T20:32:43.2307931Z                 op = torch.compile(op)
2025-05-07T20:32:43.2308033Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2308102Z     
2025-05-07T20:32:43.2308192Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2308197Z 
2025-05-07T20:32:43.2308290Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2308417Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2308513Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2308612Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2308974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2309060Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2309549Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2309641Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2310000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2310226Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2310561Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2310651Z     kernel = self.compile(
2025-05-07T20:32:43.2311046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2311213Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2311335Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2311340Z 
2025-05-07T20:32:43.2311537Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c4f4bd50>
2025-05-07T20:32:43.2312299Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2312876Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c4ca94e0>}
2025-05-07T20:32:43.2313608Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2313794Z context = <triton._C.libtriton.ir.context object at 0x7f13c4c1e9f0>
2025-05-07T20:32:43.2313798Z 
2025-05-07T20:32:43.2313955Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2314211Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2314312Z                            module_map=module_map)
2025-05-07T20:32:43.2314471Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2314569Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2314645Z E       ^
2025-05-07T20:32:43.2314992Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2315001Z 
2025-05-07T20:32:43.2315431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2315435Z 
2025-05-07T20:32:43.2315532Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2315751Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2315824Z     T=2048,
2025-05-07T20:32:43.2315893Z     D=7168,
2025-05-07T20:32:43.2315973Z     scale_ub=1200.0,
2025-05-07T20:32:43.2316053Z     contiguous=False,
2025-05-07T20:32:43.2316130Z     compiled=False,
2025-05-07T20:32:43.2316197Z )
2025-05-07T20:32:43.2316489Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2316661Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:43.2316671Z 
2025-05-07T20:32:43.2316741Z     @given(
2025-05-07T20:32:43.2316855Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2316950Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2317059Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2317170Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2317279Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2317348Z     )
2025-05-07T20:32:43.2317585Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2317673Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2317742Z         self,
2025-05-07T20:32:43.2317814Z         T: int,
2025-05-07T20:32:43.2317884Z         D: int,
2025-05-07T20:32:43.2317976Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2318067Z         contiguous: bool,
2025-05-07T20:32:43.2318148Z         compiled: bool,
2025-05-07T20:32:43.2318218Z     ) -> None:
2025-05-07T20:32:43.2318318Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2318388Z     
2025-05-07T20:32:43.2318550Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2318623Z     
2025-05-07T20:32:43.2318708Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2318825Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2318915Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2318989Z         x0 = x[:, :D]
2025-05-07T20:32:43.2319068Z         x1 = x[:, D:]
2025-05-07T20:32:43.2319136Z     
2025-05-07T20:32:43.2319215Z         if contiguous:
2025-05-07T20:32:43.2319303Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2319386Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2319457Z     
2025-05-07T20:32:43.2319546Z         if scale_ub is not None:
2025-05-07T20:32:43.2319651Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2319778Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2319936Z             )
2025-05-07T20:32:43.2320010Z         else:
2025-05-07T20:32:43.2320097Z             scale_ub_tensor = None
2025-05-07T20:32:43.2320168Z     
2025-05-07T20:32:43.2320292Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2320385Z             op = silu_mul_quant
2025-05-07T20:32:43.2320466Z             if compiled:
2025-05-07T20:32:43.2320559Z                 op = torch.compile(op)
2025-05-07T20:32:43.2320664Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2320729Z     
2025-05-07T20:32:43.2320816Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2320820Z 
2025-05-07T20:32:43.2320912Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2321040Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2321136Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2321238Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2321733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2321827Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2322179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2322394Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2322731Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2322818Z     kernel = self.compile(
2025-05-07T20:32:43.2323195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2323363Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2323564Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2323568Z 
2025-05-07T20:32:43.2323771Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c66433d0>
2025-05-07T20:32:43.2324533Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2325029Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c4ca9f80>}
2025-05-07T20:32:43.2325759Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2325942Z context = <triton._C.libtriton.ir.context object at 0x7f13c4b9e0b0>
2025-05-07T20:32:43.2325952Z 
2025-05-07T20:32:43.2326113Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2326372Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2326478Z                            module_map=module_map)
2025-05-07T20:32:43.2326632Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2326726Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2326800Z E       ^
2025-05-07T20:32:43.2327144Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2327149Z 
2025-05-07T20:32:43.2327558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2327565Z 
2025-05-07T20:32:43.2327661Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2327877Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2327956Z     T=1,
2025-05-07T20:32:43.2328027Z     D=7168,
2025-05-07T20:32:43.2328101Z     scale_ub=None,
2025-05-07T20:32:43.2328262Z     contiguous=True,
2025-05-07T20:32:43.2328343Z     compiled=False,
2025-05-07T20:32:43.2328412Z )
2025-05-07T20:32:43.2328629Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2328786Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:43.2328791Z 
2025-05-07T20:32:43.2328865Z     @given(
2025-05-07T20:32:43.2328981Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2329076Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2329191Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2329301Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2329409Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2329480Z     )
2025-05-07T20:32:43.2329722Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2329809Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2329890Z         self,
2025-05-07T20:32:43.2329963Z         T: int,
2025-05-07T20:32:43.2330036Z         D: int,
2025-05-07T20:32:43.2330132Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2330216Z         contiguous: bool,
2025-05-07T20:32:43.2330298Z         compiled: bool,
2025-05-07T20:32:43.2330371Z     ) -> None:
2025-05-07T20:32:43.2330457Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2330527Z     
2025-05-07T20:32:43.2330689Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2330756Z     
2025-05-07T20:32:43.2330845Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2330962Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2331043Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2331119Z         x0 = x[:, :D]
2025-05-07T20:32:43.2331279Z         x1 = x[:, D:]
2025-05-07T20:32:43.2331346Z     
2025-05-07T20:32:43.2331428Z         if contiguous:
2025-05-07T20:32:43.2331512Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2331600Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2331672Z     
2025-05-07T20:32:43.2331755Z         if scale_ub is not None:
2025-05-07T20:32:43.2331857Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2331983Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2332054Z             )
2025-05-07T20:32:43.2332129Z         else:
2025-05-07T20:32:43.2332217Z             scale_ub_tensor = None
2025-05-07T20:32:43.2332281Z     
2025-05-07T20:32:43.2332409Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2332494Z             op = silu_mul_quant
2025-05-07T20:32:43.2332573Z             if compiled:
2025-05-07T20:32:43.2332669Z                 op = torch.compile(op)
2025-05-07T20:32:43.2332774Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2332848Z     
2025-05-07T20:32:43.2332936Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2332940Z 
2025-05-07T20:32:43.2333032Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2333170Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2333264Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2333358Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2333849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2333942Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2334293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2334515Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2334848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2334948Z     kernel = self.compile(
2025-05-07T20:32:43.2335424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2335594Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2335721Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2335725Z 
2025-05-07T20:32:43.2335920Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c4ecd8d0>
2025-05-07T20:32:43.2336682Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2337171Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c4cab2e0>}
2025-05-07T20:32:43.2337913Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2338102Z context = <triton._C.libtriton.ir.context object at 0x7f13c546f6b0>
2025-05-07T20:32:43.2338107Z 
2025-05-07T20:32:43.2338266Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2338528Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2338635Z                            module_map=module_map)
2025-05-07T20:32:43.2338792Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2338890Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2338966Z E       ^
2025-05-07T20:32:43.2339317Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2340045Z 
2025-05-07T20:32:43.2342149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2342167Z 
2025-05-07T20:32:43.2342277Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2342504Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2342583Z     T=16384,
2025-05-07T20:32:43.2342656Z     D=7168,
2025-05-07T20:32:43.2342741Z     scale_ub=1200.0,
2025-05-07T20:32:43.2342822Z     contiguous=False,
2025-05-07T20:32:43.2342904Z     compiled=True,
2025-05-07T20:32:43.2342970Z )
2025-05-07T20:32:43.2343185Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2343366Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:43.2343371Z 
2025-05-07T20:32:43.2343443Z     @given(
2025-05-07T20:32:43.2343558Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2343661Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2343771Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2343884Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2343998Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2344067Z     )
2025-05-07T20:32:43.2344310Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2344398Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2344471Z         self,
2025-05-07T20:32:43.2344549Z         T: int,
2025-05-07T20:32:43.2344621Z         D: int,
2025-05-07T20:32:43.2344712Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2344798Z         contiguous: bool,
2025-05-07T20:32:43.2344876Z         compiled: bool,
2025-05-07T20:32:43.2344951Z     ) -> None:
2025-05-07T20:32:43.2345040Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2345108Z     
2025-05-07T20:32:43.2345272Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2345354Z     
2025-05-07T20:32:43.2345442Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2345833Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2345923Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2345996Z         x0 = x[:, :D]
2025-05-07T20:32:43.2346071Z         x1 = x[:, D:]
2025-05-07T20:32:43.2346135Z     
2025-05-07T20:32:43.2346212Z         if contiguous:
2025-05-07T20:32:43.2346301Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2346386Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2346453Z     
2025-05-07T20:32:43.2346542Z         if scale_ub is not None:
2025-05-07T20:32:43.2346642Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2346771Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2346849Z             )
2025-05-07T20:32:43.2346923Z         else:
2025-05-07T20:32:43.2347018Z             scale_ub_tensor = None
2025-05-07T20:32:43.2347090Z     
2025-05-07T20:32:43.2347215Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2347303Z             op = silu_mul_quant
2025-05-07T20:32:43.2347390Z             if compiled:
2025-05-07T20:32:43.2347583Z                 op = torch.compile(op)
2025-05-07T20:32:43.2347689Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2347754Z     
2025-05-07T20:32:43.2347840Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2347844Z 
2025-05-07T20:32:43.2347939Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2348066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2348163Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2348261Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2348625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2348716Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2349212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2349440Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2349803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2350021Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2350360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2350448Z     kernel = self.compile(
2025-05-07T20:32:43.2350829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2351000Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2351121Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2351126Z 
2025-05-07T20:32:43.2351328Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c63843d0>
2025-05-07T20:32:43.2352107Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2352602Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c4bb85e0>}
2025-05-07T20:32:43.2353495Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2353684Z context = <triton._C.libtriton.ir.context object at 0x7f13c5464170>
2025-05-07T20:32:43.2353688Z 
2025-05-07T20:32:43.2353851Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2354114Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2354312Z                            module_map=module_map)
2025-05-07T20:32:43.2354477Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2354571Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2354643Z E       ^
2025-05-07T20:32:43.2354993Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2354998Z 
2025-05-07T20:32:43.2355409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2355414Z 
2025-05-07T20:32:43.2355514Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2355729Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2355799Z     T=1,
2025-05-07T20:32:43.2355874Z     D=7168,
2025-05-07T20:32:43.2355953Z     scale_ub=None,
2025-05-07T20:32:43.2356036Z     contiguous=False,
2025-05-07T20:32:43.2356120Z     compiled=False,
2025-05-07T20:32:43.2356186Z )
2025-05-07T20:32:43.2356405Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2356567Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:43.2356572Z 
2025-05-07T20:32:43.2356643Z     @given(
2025-05-07T20:32:43.2356760Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2356852Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2356962Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2357077Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2357182Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2357253Z     )
2025-05-07T20:32:43.2357495Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2357664Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2357742Z         self,
2025-05-07T20:32:43.2357820Z         T: int,
2025-05-07T20:32:43.2357890Z         D: int,
2025-05-07T20:32:43.2357989Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2358076Z         contiguous: bool,
2025-05-07T20:32:43.2358153Z         compiled: bool,
2025-05-07T20:32:43.2358227Z     ) -> None:
2025-05-07T20:32:43.2358315Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2358381Z     
2025-05-07T20:32:43.2358546Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2358615Z     
2025-05-07T20:32:43.2358698Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2358819Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2358901Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2358981Z         x0 = x[:, :D]
2025-05-07T20:32:43.2359055Z         x1 = x[:, D:]
2025-05-07T20:32:43.2359122Z     
2025-05-07T20:32:43.2359200Z         if contiguous:
2025-05-07T20:32:43.2359292Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2359374Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2359446Z     
2025-05-07T20:32:43.2359537Z         if scale_ub is not None:
2025-05-07T20:32:43.2359637Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2359769Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2359840Z             )
2025-05-07T20:32:43.2359913Z         else:
2025-05-07T20:32:43.2360008Z             scale_ub_tensor = None
2025-05-07T20:32:43.2360075Z     
2025-05-07T20:32:43.2360199Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2360293Z             op = silu_mul_quant
2025-05-07T20:32:43.2360371Z             if compiled:
2025-05-07T20:32:43.2360469Z                 op = torch.compile(op)
2025-05-07T20:32:43.2360569Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2360638Z     
2025-05-07T20:32:43.2360726Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2360730Z 
2025-05-07T20:32:43.2360825Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2360953Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2361131Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2361227Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2361717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2361805Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2362159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2362378Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2362714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2362803Z     kernel = self.compile(
2025-05-07T20:32:43.2363186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2363364Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2363494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2363499Z 
2025-05-07T20:32:43.2363760Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c4ecf550>
2025-05-07T20:32:43.2364568Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2365063Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c4bb8fe0>}
2025-05-07T20:32:43.2365798Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2366076Z context = <triton._C.libtriton.ir.context object at 0x7f13c4da10b0>
2025-05-07T20:32:43.2366081Z 
2025-05-07T20:32:43.2366237Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2366494Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2366594Z                            module_map=module_map)
2025-05-07T20:32:43.2366748Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2366843Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2366915Z E       ^
2025-05-07T20:32:43.2367288Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2367293Z 
2025-05-07T20:32:43.2367749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2367759Z 
2025-05-07T20:32:43.2367853Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2368073Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2368142Z     T=2048,
2025-05-07T20:32:43.2368215Z     D=7168,
2025-05-07T20:32:43.2368295Z     scale_ub=None,
2025-05-07T20:32:43.2368378Z     contiguous=False,
2025-05-07T20:32:43.2368452Z     compiled=True,
2025-05-07T20:32:43.2368524Z )
2025-05-07T20:32:43.2368734Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2368903Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:43.2368908Z 
2025-05-07T20:32:43.2368982Z     @given(
2025-05-07T20:32:43.2369092Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2369186Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2369293Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2369409Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2369524Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2369675Z     )
2025-05-07T20:32:43.2369914Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2370003Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2370074Z         self,
2025-05-07T20:32:43.2370148Z         T: int,
2025-05-07T20:32:43.2370223Z         D: int,
2025-05-07T20:32:43.2370313Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2370399Z         contiguous: bool,
2025-05-07T20:32:43.2370476Z         compiled: bool,
2025-05-07T20:32:43.2370544Z     ) -> None:
2025-05-07T20:32:43.2370633Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2370698Z     
2025-05-07T20:32:43.2370857Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2370928Z     
2025-05-07T20:32:43.2371014Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2371134Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2371218Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2371291Z         x0 = x[:, :D]
2025-05-07T20:32:43.2371369Z         x1 = x[:, D:]
2025-05-07T20:32:43.2371438Z     
2025-05-07T20:32:43.2371514Z         if contiguous:
2025-05-07T20:32:43.2371604Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2371687Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2371755Z     
2025-05-07T20:32:43.2371839Z         if scale_ub is not None:
2025-05-07T20:32:43.2371940Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2372068Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2372141Z             )
2025-05-07T20:32:43.2372212Z         else:
2025-05-07T20:32:43.2372298Z             scale_ub_tensor = None
2025-05-07T20:32:43.2372372Z     
2025-05-07T20:32:43.2372494Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2372577Z             op = silu_mul_quant
2025-05-07T20:32:43.2372772Z             if compiled:
2025-05-07T20:32:43.2372864Z                 op = torch.compile(op)
2025-05-07T20:32:43.2372968Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2373038Z     
2025-05-07T20:32:43.2373119Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2373124Z 
2025-05-07T20:32:43.2373218Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2373341Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2373436Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2373533Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2373894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2373979Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2374553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2374699Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2375236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2375503Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2375839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2375930Z     kernel = self.compile(
2025-05-07T20:32:43.2376308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2376474Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2376596Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2376601Z 
2025-05-07T20:32:43.2376795Z self = <triton.compiler.compiler.ASTSource object at 0x7f13e11aef50>
2025-05-07T20:32:43.2377708Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2378206Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c4bba7a0>}
2025-05-07T20:32:43.2378944Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2379128Z context = <triton._C.libtriton.ir.context object at 0x7f13c54bb9b0>
2025-05-07T20:32:43.2379132Z 
2025-05-07T20:32:43.2379288Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2379544Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2379651Z                            module_map=module_map)
2025-05-07T20:32:43.2379808Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2379903Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2379976Z E       ^
2025-05-07T20:32:43.2380328Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2380333Z 
2025-05-07T20:32:43.2380741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2380746Z 
2025-05-07T20:32:43.2380853Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2381070Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2381142Z     T=4096,
2025-05-07T20:32:43.2381215Z     D=7168,
2025-05-07T20:32:43.2381291Z     scale_ub=None,
2025-05-07T20:32:43.2381375Z     contiguous=False,
2025-05-07T20:32:43.2381534Z     compiled=True,
2025-05-07T20:32:43.2381602Z )
2025-05-07T20:32:43.2381813Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2381987Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:43.2381992Z 
2025-05-07T20:32:43.2382062Z     @given(
2025-05-07T20:32:43.2382172Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2382266Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2382373Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2382488Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2382593Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2382661Z     )
2025-05-07T20:32:43.2382902Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2382994Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2383066Z         self,
2025-05-07T20:32:43.2383137Z         T: int,
2025-05-07T20:32:43.2383212Z         D: int,
2025-05-07T20:32:43.2383301Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2383395Z         contiguous: bool,
2025-05-07T20:32:43.2383478Z         compiled: bool,
2025-05-07T20:32:43.2383556Z     ) -> None:
2025-05-07T20:32:43.2383645Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2383712Z     
2025-05-07T20:32:43.2383881Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2383953Z     
2025-05-07T20:32:43.2384041Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2384163Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2384244Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2384316Z         x0 = x[:, :D]
2025-05-07T20:32:43.2384398Z         x1 = x[:, D:]
2025-05-07T20:32:43.2384465Z     
2025-05-07T20:32:43.2384541Z         if contiguous:
2025-05-07T20:32:43.2384631Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2384713Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2384785Z     
2025-05-07T20:32:43.2384872Z         if scale_ub is not None:
2025-05-07T20:32:43.2384972Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2385252Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2385359Z             )
2025-05-07T20:32:43.2385433Z         else:
2025-05-07T20:32:43.2385526Z             scale_ub_tensor = None
2025-05-07T20:32:43.2385593Z     
2025-05-07T20:32:43.2385718Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2385806Z             op = silu_mul_quant
2025-05-07T20:32:43.2385883Z             if compiled:
2025-05-07T20:32:43.2385972Z                 op = torch.compile(op)
2025-05-07T20:32:43.2386074Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2386139Z     
2025-05-07T20:32:43.2386223Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2386230Z 
2025-05-07T20:32:43.2386321Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2386446Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2386549Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2386643Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2387012Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2387104Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2387670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2387759Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2388113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2388328Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2388672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2388758Z     kernel = self.compile(
2025-05-07T20:32:43.2389226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2389400Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2389519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2389524Z 
2025-05-07T20:32:43.2389724Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c4ecd3d0>
2025-05-07T20:32:43.2390489Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2390983Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c4bbb4c0>}
2025-05-07T20:32:43.2391724Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2391914Z context = <triton._C.libtriton.ir.context object at 0x7f13c4f8a630>
2025-05-07T20:32:43.2391918Z 
2025-05-07T20:32:43.2392078Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2392330Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2392430Z                            module_map=module_map)
2025-05-07T20:32:43.2392588Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2392679Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2392755Z E       ^
2025-05-07T20:32:43.2393101Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2393106Z 
2025-05-07T20:32:43.2393519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2393524Z 
2025-05-07T20:32:43.2393705Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2393924Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2394004Z     T=16384,
2025-05-07T20:32:43.2394078Z     D=5120,
2025-05-07T20:32:43.2394151Z     scale_ub=1200.0,
2025-05-07T20:32:43.2394236Z     contiguous=False,
2025-05-07T20:32:43.2394315Z     compiled=False,
2025-05-07T20:32:43.2394383Z )
2025-05-07T20:32:43.2394603Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2394778Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:43.2394783Z 
2025-05-07T20:32:43.2394855Z     @given(
2025-05-07T20:32:43.2394969Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2395062Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2395174Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2395287Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2395400Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2395476Z     )
2025-05-07T20:32:43.2395714Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2395830Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2395936Z         self,
2025-05-07T20:32:43.2396040Z         T: int,
2025-05-07T20:32:43.2396127Z         D: int,
2025-05-07T20:32:43.2396225Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2396307Z         contiguous: bool,
2025-05-07T20:32:43.2396384Z         compiled: bool,
2025-05-07T20:32:43.2396459Z     ) -> None:
2025-05-07T20:32:43.2396546Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2396613Z     
2025-05-07T20:32:43.2396779Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2396942Z     
2025-05-07T20:32:43.2397035Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2397156Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2397262Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2397342Z         x0 = x[:, :D]
2025-05-07T20:32:43.2397431Z         x1 = x[:, D:]
2025-05-07T20:32:43.2397507Z     
2025-05-07T20:32:43.2397584Z         if contiguous:
2025-05-07T20:32:43.2397668Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2397749Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2397823Z     
2025-05-07T20:32:43.2397907Z         if scale_ub is not None:
2025-05-07T20:32:43.2398007Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2398137Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2398207Z             )
2025-05-07T20:32:43.2398280Z         else:
2025-05-07T20:32:43.2398370Z             scale_ub_tensor = None
2025-05-07T20:32:43.2398442Z     
2025-05-07T20:32:43.2398573Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2398661Z             op = silu_mul_quant
2025-05-07T20:32:43.2398739Z             if compiled:
2025-05-07T20:32:43.2398842Z                 op = torch.compile(op)
2025-05-07T20:32:43.2398944Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2399013Z     
2025-05-07T20:32:43.2399101Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2399105Z 
2025-05-07T20:32:43.2399197Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2399325Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2399418Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2399519Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2400011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2400102Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2400457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2400682Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2401128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2401219Z     kernel = self.compile(
2025-05-07T20:32:43.2401597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2401765Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2401887Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2401891Z 
2025-05-07T20:32:43.2402086Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c63bad50>
2025-05-07T20:32:43.2402853Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2403355Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c4f1c860>}
2025-05-07T20:32:43.2404093Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2404278Z context = <triton._C.libtriton.ir.context object at 0x7f13c4fd47f0>
2025-05-07T20:32:43.2404282Z 
2025-05-07T20:32:43.2404439Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2404697Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2404800Z                            module_map=module_map)
2025-05-07T20:32:43.2404953Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2405128Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2405199Z E       ^
2025-05-07T20:32:43.2405551Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2405561Z 
2025-05-07T20:32:43.2405971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2405976Z 
2025-05-07T20:32:43.2406071Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2406292Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2406364Z     T=16384,
2025-05-07T20:32:43.2406438Z     D=5120,
2025-05-07T20:32:43.2406539Z     scale_ub=1200.0,
2025-05-07T20:32:43.2406655Z     contiguous=True,
2025-05-07T20:32:43.2406769Z     compiled=True,
2025-05-07T20:32:43.2406860Z )
2025-05-07T20:32:43.2407075Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2407256Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:43.2407261Z 
2025-05-07T20:32:43.2407339Z     @given(
2025-05-07T20:32:43.2407451Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2407545Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2407653Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2407767Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2407887Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2407957Z     )
2025-05-07T20:32:43.2408192Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2408286Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2408358Z         self,
2025-05-07T20:32:43.2408435Z         T: int,
2025-05-07T20:32:43.2408507Z         D: int,
2025-05-07T20:32:43.2408596Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2408684Z         contiguous: bool,
2025-05-07T20:32:43.2408762Z         compiled: bool,
2025-05-07T20:32:43.2408835Z     ) -> None:
2025-05-07T20:32:43.2408928Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2409085Z     
2025-05-07T20:32:43.2409248Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2409320Z     
2025-05-07T20:32:43.2409405Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2409523Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2409611Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2409684Z         x0 = x[:, :D]
2025-05-07T20:32:43.2409760Z         x1 = x[:, D:]
2025-05-07T20:32:43.2409828Z     
2025-05-07T20:32:43.2409905Z         if contiguous:
2025-05-07T20:32:43.2409991Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2410076Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2410145Z     
2025-05-07T20:32:43.2410232Z         if scale_ub is not None:
2025-05-07T20:32:43.2410332Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2410462Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2410537Z             )
2025-05-07T20:32:43.2410614Z         else:
2025-05-07T20:32:43.2410702Z             scale_ub_tensor = None
2025-05-07T20:32:43.2410785Z     
2025-05-07T20:32:43.2410908Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2410993Z             op = silu_mul_quant
2025-05-07T20:32:43.2411075Z             if compiled:
2025-05-07T20:32:43.2411169Z                 op = torch.compile(op)
2025-05-07T20:32:43.2411274Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2411342Z     
2025-05-07T20:32:43.2411426Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2411430Z 
2025-05-07T20:32:43.2411524Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2411646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2411739Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2411833Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2412284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2412377Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2412864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2412953Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2413307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2413521Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2413857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2413946Z     kernel = self.compile(
2025-05-07T20:32:43.2414325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2414504Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2414628Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2414633Z 
2025-05-07T20:32:43.2414830Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c4baf1d0>
2025-05-07T20:32:43.2415597Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2416089Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c4f1db20>}
2025-05-07T20:32:43.2416826Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2417012Z context = <triton._C.libtriton.ir.context object at 0x7f13c4869fb0>
2025-05-07T20:32:43.2417096Z 
2025-05-07T20:32:43.2417276Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2417648Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2417792Z                            module_map=module_map)
2025-05-07T20:32:43.2418015Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2418146Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2418248Z E       ^
2025-05-07T20:32:43.2418653Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2418658Z 
2025-05-07T20:32:43.2419067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2419078Z 
2025-05-07T20:32:43.2419177Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2419398Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2419472Z     T=16384,
2025-05-07T20:32:43.2419550Z     D=5120,
2025-05-07T20:32:43.2419627Z     scale_ub=None,
2025-05-07T20:32:43.2419711Z     contiguous=False,
2025-05-07T20:32:43.2419793Z     compiled=True,
2025-05-07T20:32:43.2419864Z )
2025-05-07T20:32:43.2420076Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2420251Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:43.2420256Z 
2025-05-07T20:32:43.2420328Z     @given(
2025-05-07T20:32:43.2420451Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2420549Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2420658Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2420776Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2420986Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2421058Z     )
2025-05-07T20:32:43.2421304Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2421393Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2421465Z         self,
2025-05-07T20:32:43.2421540Z         T: int,
2025-05-07T20:32:43.2421612Z         D: int,
2025-05-07T20:32:43.2421709Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2421793Z         contiguous: bool,
2025-05-07T20:32:43.2421874Z         compiled: bool,
2025-05-07T20:32:43.2421950Z     ) -> None:
2025-05-07T20:32:43.2422041Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2422110Z     
2025-05-07T20:32:43.2422274Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2422344Z     
2025-05-07T20:32:43.2422430Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2426500Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2426617Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2426699Z         x0 = x[:, :D]
2025-05-07T20:32:43.2426779Z         x1 = x[:, D:]
2025-05-07T20:32:43.2426851Z     
2025-05-07T20:32:43.2426933Z         if contiguous:
2025-05-07T20:32:43.2427026Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2427111Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2427183Z     
2025-05-07T20:32:43.2427266Z         if scale_ub is not None:
2025-05-07T20:32:43.2427368Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2427581Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2427655Z             )
2025-05-07T20:32:43.2427729Z         else:
2025-05-07T20:32:43.2427826Z             scale_ub_tensor = None
2025-05-07T20:32:43.2427896Z     
2025-05-07T20:32:43.2428061Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2428192Z             op = silu_mul_quant
2025-05-07T20:32:43.2428308Z             if compiled:
2025-05-07T20:32:43.2428431Z                 op = torch.compile(op)
2025-05-07T20:32:43.2428540Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2428606Z     
2025-05-07T20:32:43.2428811Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2428817Z 
2025-05-07T20:32:43.2428912Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2429040Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2429141Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2429235Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2429613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2429703Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2430189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2430283Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2430640Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2430863Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2431202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2431292Z     kernel = self.compile(
2025-05-07T20:32:43.2431675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2431850Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2431972Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2431977Z 
2025-05-07T20:32:43.2432177Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c4f4a850>
2025-05-07T20:32:43.2432944Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2433568Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c4f1e8e0>}
2025-05-07T20:32:43.2434303Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2434489Z context = <triton._C.libtriton.ir.context object at 0x7f13c48b0c30>
2025-05-07T20:32:43.2434493Z 
2025-05-07T20:32:43.2434655Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2434913Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2435022Z                            module_map=module_map)
2025-05-07T20:32:43.2435186Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2435281Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2435360Z E       ^
2025-05-07T20:32:43.2435710Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2435716Z 
2025-05-07T20:32:43.2436120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2436129Z 
2025-05-07T20:32:43.2436225Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2436442Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2436517Z     T=2048,
2025-05-07T20:32:43.2436590Z     D=5120,
2025-05-07T20:32:43.2436679Z     scale_ub=None,
2025-05-07T20:32:43.2436766Z     contiguous=False,
2025-05-07T20:32:43.2436844Z     compiled=True,
2025-05-07T20:32:43.2436910Z )
2025-05-07T20:32:43.2437133Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2437407Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:43.2437412Z 
2025-05-07T20:32:43.2437490Z     @given(
2025-05-07T20:32:43.2437611Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2437708Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2437834Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2437947Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2438055Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2438132Z     )
2025-05-07T20:32:43.2438374Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2438461Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2438537Z         self,
2025-05-07T20:32:43.2438610Z         T: int,
2025-05-07T20:32:43.2438684Z         D: int,
2025-05-07T20:32:43.2438807Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2438933Z         contiguous: bool,
2025-05-07T20:32:43.2439055Z         compiled: bool,
2025-05-07T20:32:43.2439143Z     ) -> None:
2025-05-07T20:32:43.2439234Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2439306Z     
2025-05-07T20:32:43.2439471Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2439545Z     
2025-05-07T20:32:43.2439638Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2439758Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2439840Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2439926Z         x0 = x[:, :D]
2025-05-07T20:32:43.2440001Z         x1 = x[:, D:]
2025-05-07T20:32:43.2440489Z     
2025-05-07T20:32:43.2440623Z         if contiguous:
2025-05-07T20:32:43.2440751Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2440871Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2440976Z     
2025-05-07T20:32:43.2441112Z         if scale_ub is not None:
2025-05-07T20:32:43.2441411Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2441545Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2441623Z             )
2025-05-07T20:32:43.2441701Z         else:
2025-05-07T20:32:43.2441792Z             scale_ub_tensor = None
2025-05-07T20:32:43.2441862Z     
2025-05-07T20:32:43.2441993Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2442080Z             op = silu_mul_quant
2025-05-07T20:32:43.2442164Z             if compiled:
2025-05-07T20:32:43.2442266Z                 op = torch.compile(op)
2025-05-07T20:32:43.2442369Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2442437Z     
2025-05-07T20:32:43.2442525Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2442530Z 
2025-05-07T20:32:43.2442625Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2442760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2442860Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2442955Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2443329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2443419Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2443909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2444007Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2444364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2444586Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2444922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2445010Z     kernel = self.compile(
2025-05-07T20:32:43.2445414Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2445708Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2445836Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2445842Z 
2025-05-07T20:32:43.2446036Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c4a146d0>
2025-05-07T20:32:43.2446810Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2447312Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c48c4040>}
2025-05-07T20:32:43.2448052Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2448254Z context = <triton._C.libtriton.ir.context object at 0x7f13c484fdf0>
2025-05-07T20:32:43.2448259Z 
2025-05-07T20:32:43.2448417Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2448673Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2448781Z                            module_map=module_map)
2025-05-07T20:32:43.2448938Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2449036Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2449109Z E       ^
2025-05-07T20:32:43.2449456Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2449461Z 
2025-05-07T20:32:43.2450032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2450134Z 
2025-05-07T20:32:43.2450237Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2450497Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2450601Z     T=2048,
2025-05-07T20:32:43.2450685Z     D=5120,
2025-05-07T20:32:43.2450767Z     scale_ub=1200.0,
2025-05-07T20:32:43.2450848Z     contiguous=False,
2025-05-07T20:32:43.2450933Z     compiled=True,
2025-05-07T20:32:43.2451001Z )
2025-05-07T20:32:43.2451215Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2451384Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:43.2451389Z 
2025-05-07T20:32:43.2451466Z     @given(
2025-05-07T20:32:43.2451583Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2451680Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2451791Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2451908Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2452020Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2452094Z     )
2025-05-07T20:32:43.2452332Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2452424Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2452496Z         self,
2025-05-07T20:32:43.2452570Z         T: int,
2025-05-07T20:32:43.2452645Z         D: int,
2025-05-07T20:32:43.2452737Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2452823Z         contiguous: bool,
2025-05-07T20:32:43.2452907Z         compiled: bool,
2025-05-07T20:32:43.2452981Z     ) -> None:
2025-05-07T20:32:43.2453075Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2453143Z     
2025-05-07T20:32:43.2453306Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2453376Z     
2025-05-07T20:32:43.2453464Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2453587Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2453676Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2453836Z         x0 = x[:, :D]
2025-05-07T20:32:43.2453914Z         x1 = x[:, D:]
2025-05-07T20:32:43.2453986Z     
2025-05-07T20:32:43.2454063Z         if contiguous:
2025-05-07T20:32:43.2454147Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2454233Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2454302Z     
2025-05-07T20:32:43.2454386Z         if scale_ub is not None:
2025-05-07T20:32:43.2454490Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2454617Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2454691Z             )
2025-05-07T20:32:43.2454763Z         else:
2025-05-07T20:32:43.2454852Z             scale_ub_tensor = None
2025-05-07T20:32:43.2454921Z     
2025-05-07T20:32:43.2455047Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2455139Z             op = silu_mul_quant
2025-05-07T20:32:43.2455224Z             if compiled:
2025-05-07T20:32:43.2455317Z                 op = torch.compile(op)
2025-05-07T20:32:43.2455422Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2455496Z     
2025-05-07T20:32:43.2455582Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2455587Z 
2025-05-07T20:32:43.2455680Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2455803Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2455897Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2455993Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2456355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2456443Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2456932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2457108Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2457525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2457743Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2458079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2458173Z     kernel = self.compile(
2025-05-07T20:32:43.2458573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2458747Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2458875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2458879Z 
2025-05-07T20:32:43.2459075Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c4ece450>
2025-05-07T20:32:43.2459848Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2460344Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c48c4e00>}
2025-05-07T20:32:43.2461082Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2461268Z context = <triton._C.libtriton.ir.context object at 0x7f13c4d5e930>
2025-05-07T20:32:43.2461273Z 
2025-05-07T20:32:43.2461434Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2461693Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2461800Z                            module_map=module_map)
2025-05-07T20:32:43.2461956Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2462137Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2462211Z E       ^
2025-05-07T20:32:43.2462562Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2462568Z 
2025-05-07T20:32:43.2462979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2462984Z 
2025-05-07T20:32:43.2463082Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2463299Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2463374Z     T=4096,
2025-05-07T20:32:43.2463450Z     D=5120,
2025-05-07T20:32:43.2463530Z     scale_ub=1200.0,
2025-05-07T20:32:43.2463609Z     contiguous=True,
2025-05-07T20:32:43.2463693Z     compiled=True,
2025-05-07T20:32:43.2463762Z )
2025-05-07T20:32:43.2463975Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2464150Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:43.2464155Z 
2025-05-07T20:32:43.2464226Z     @given(
2025-05-07T20:32:43.2464340Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2464439Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2464549Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2464668Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2464774Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2464847Z     )
2025-05-07T20:32:43.2465088Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2465177Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2465248Z         self,
2025-05-07T20:32:43.2465323Z         T: int,
2025-05-07T20:32:43.2465498Z         D: int,
2025-05-07T20:32:43.2465591Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2465680Z         contiguous: bool,
2025-05-07T20:32:43.2465763Z         compiled: bool,
2025-05-07T20:32:43.2465834Z     ) -> None:
2025-05-07T20:32:43.2465928Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2465996Z     
2025-05-07T20:32:43.2466162Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2466232Z     
2025-05-07T20:32:43.2466317Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2466442Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2466527Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2466601Z         x0 = x[:, :D]
2025-05-07T20:32:43.2466679Z         x1 = x[:, D:]
2025-05-07T20:32:43.2466745Z     
2025-05-07T20:32:43.2466822Z         if contiguous:
2025-05-07T20:32:43.2466911Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2466995Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2467073Z     
2025-05-07T20:32:43.2467167Z         if scale_ub is not None:
2025-05-07T20:32:43.2467291Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2467524Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2467611Z             )
2025-05-07T20:32:43.2467685Z         else:
2025-05-07T20:32:43.2467777Z             scale_ub_tensor = None
2025-05-07T20:32:43.2467845Z     
2025-05-07T20:32:43.2467969Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2468056Z             op = silu_mul_quant
2025-05-07T20:32:43.2468137Z             if compiled:
2025-05-07T20:32:43.2468233Z                 op = torch.compile(op)
2025-05-07T20:32:43.2468338Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2468407Z     
2025-05-07T20:32:43.2468491Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2468495Z 
2025-05-07T20:32:43.2468592Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2468719Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2468822Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2468915Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2469408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2469504Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2469986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2470078Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2470435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2470652Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2470985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2471077Z     kernel = self.compile(
2025-05-07T20:32:43.2471460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2471634Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2471754Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2471759Z 
2025-05-07T20:32:43.2471958Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c4f09750>
2025-05-07T20:32:43.2472722Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2473212Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c48c60c0>}
2025-05-07T20:32:43.2474110Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2474296Z context = <triton._C.libtriton.ir.context object at 0x7f13c4d46bf0>
2025-05-07T20:32:43.2474302Z 
2025-05-07T20:32:43.2474462Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2474717Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2474822Z                            module_map=module_map)
2025-05-07T20:32:43.2474982Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2475074Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2475146Z E       ^
2025-05-07T20:32:43.2475495Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2475500Z 
2025-05-07T20:32:43.2475908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2475912Z 
2025-05-07T20:32:43.2476017Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2476232Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2476305Z     T=128,
2025-05-07T20:32:43.2476383Z     D=5120,
2025-05-07T20:32:43.2476463Z     scale_ub=1200.0,
2025-05-07T20:32:43.2476547Z     contiguous=False,
2025-05-07T20:32:43.2476623Z     compiled=True,
2025-05-07T20:32:43.2476692Z )
2025-05-07T20:32:43.2476911Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2477075Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:43.2477079Z 
2025-05-07T20:32:43.2477152Z     @given(
2025-05-07T20:32:43.2477268Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2477363Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2477478Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2477594Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2477782Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2477860Z     )
2025-05-07T20:32:43.2478098Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2478184Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2478259Z         self,
2025-05-07T20:32:43.2478333Z         T: int,
2025-05-07T20:32:43.2478405Z         D: int,
2025-05-07T20:32:43.2478498Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2478585Z         contiguous: bool,
2025-05-07T20:32:43.2478664Z         compiled: bool,
2025-05-07T20:32:43.2478738Z     ) -> None:
2025-05-07T20:32:43.2478828Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2478897Z     
2025-05-07T20:32:43.2479062Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2479137Z     
2025-05-07T20:32:43.2479226Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2479343Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2479431Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2479509Z         x0 = x[:, :D]
2025-05-07T20:32:43.2479583Z         x1 = x[:, D:]
2025-05-07T20:32:43.2479648Z     
2025-05-07T20:32:43.2479730Z         if contiguous:
2025-05-07T20:32:43.2479814Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2479896Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2479969Z     
2025-05-07T20:32:43.2480053Z         if scale_ub is not None:
2025-05-07T20:32:43.2480151Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2480284Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2480358Z             )
2025-05-07T20:32:43.2480432Z         else:
2025-05-07T20:32:43.2480522Z             scale_ub_tensor = None
2025-05-07T20:32:43.2480591Z     
2025-05-07T20:32:43.2480718Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2480886Z             op = silu_mul_quant
2025-05-07T20:32:43.2480966Z             if compiled:
2025-05-07T20:32:43.2481068Z                 op = torch.compile(op)
2025-05-07T20:32:43.2481168Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2481237Z     
2025-05-07T20:32:43.2481326Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2481330Z 
2025-05-07T20:32:43.2481423Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2481546Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2481644Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2481737Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2482103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2482192Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2482674Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2482775Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2483133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2483348Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2483682Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2483770Z     kernel = self.compile(
2025-05-07T20:32:43.2484169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2484336Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2484456Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2484460Z 
2025-05-07T20:32:43.2484657Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c4f48850>
2025-05-07T20:32:43.2485506Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2486000Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c48c72e0>}
2025-05-07T20:32:43.2486735Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2486922Z context = <triton._C.libtriton.ir.context object at 0x7f13c4a6feb0>
2025-05-07T20:32:43.2486926Z 
2025-05-07T20:32:43.2487083Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2487340Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2487449Z                            module_map=module_map)
2025-05-07T20:32:43.2487609Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2487703Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2487776Z E       ^
2025-05-07T20:32:43.2488121Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2488125Z 
2025-05-07T20:32:43.2488534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2488538Z 
2025-05-07T20:32:43.2488636Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2488852Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2488929Z     T=16384,
2025-05-07T20:32:43.2488999Z     D=7168,
2025-05-07T20:32:43.2489076Z     scale_ub=1200.0,
2025-05-07T20:32:43.2489234Z     contiguous=True,
2025-05-07T20:32:43.2489310Z     compiled=True,
2025-05-07T20:32:43.2489380Z )
2025-05-07T20:32:43.2489596Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2489767Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:43.2489772Z 
2025-05-07T20:32:43.2489847Z     @given(
2025-05-07T20:32:43.2489959Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2490054Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2490168Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2490279Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2490387Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2490463Z     )
2025-05-07T20:32:43.2490698Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2490787Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2490864Z         self,
2025-05-07T20:32:43.2490936Z         T: int,
2025-05-07T20:32:43.2491010Z         D: int,
2025-05-07T20:32:43.2491104Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2491193Z         contiguous: bool,
2025-05-07T20:32:43.2491281Z         compiled: bool,
2025-05-07T20:32:43.2491356Z     ) -> None:
2025-05-07T20:32:43.2491443Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2491512Z     
2025-05-07T20:32:43.2491674Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2491742Z     
2025-05-07T20:32:43.2491832Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2491949Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2492035Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2492111Z         x0 = x[:, :D]
2025-05-07T20:32:43.2492184Z         x1 = x[:, D:]
2025-05-07T20:32:43.2492255Z     
2025-05-07T20:32:43.2492332Z         if contiguous:
2025-05-07T20:32:43.2492417Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2492508Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2492575Z     
2025-05-07T20:32:43.2492660Z         if scale_ub is not None:
2025-05-07T20:32:43.2492844Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2492974Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2493050Z             )
2025-05-07T20:32:43.2493123Z         else:
2025-05-07T20:32:43.2493211Z             scale_ub_tensor = None
2025-05-07T20:32:43.2493279Z     
2025-05-07T20:32:43.2493406Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2493488Z             op = silu_mul_quant
2025-05-07T20:32:43.2493573Z             if compiled:
2025-05-07T20:32:43.2493668Z                 op = torch.compile(op)
2025-05-07T20:32:43.2493768Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2493839Z     
2025-05-07T20:32:43.2493923Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2493928Z 
2025-05-07T20:32:43.2494018Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2494147Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2494242Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2494339Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2494702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2494790Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2495273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2495366Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2495717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2495941Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2496275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2496474Z     kernel = self.compile(
2025-05-07T20:32:43.2496860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2497030Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2497152Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2497156Z 
2025-05-07T20:32:43.2497353Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c5d5e250>
2025-05-07T20:32:43.2498122Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2498614Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c4aaca40>}
2025-05-07T20:32:43.2499361Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2499548Z context = <triton._C.libtriton.ir.context object at 0x7f13c4a54fb0>
2025-05-07T20:32:43.2499552Z 
2025-05-07T20:32:43.2499710Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2499965Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2500067Z                            module_map=module_map)
2025-05-07T20:32:43.2500222Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2500318Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2500389Z E       ^
2025-05-07T20:32:43.2500734Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2500747Z 
2025-05-07T20:32:43.2501224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2501229Z 
2025-05-07T20:32:43.2501327Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2501544Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2501619Z     T=16384,
2025-05-07T20:32:43.2501690Z     D=5120,
2025-05-07T20:32:43.2501769Z     scale_ub=1200.0,
2025-05-07T20:32:43.2501849Z     contiguous=True,
2025-05-07T20:32:43.2501927Z     compiled=False,
2025-05-07T20:32:43.2502002Z )
2025-05-07T20:32:43.2502214Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2502393Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:43.2502398Z 
2025-05-07T20:32:43.2502470Z     @given(
2025-05-07T20:32:43.2502582Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2502689Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2502798Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2502914Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2503024Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2503092Z     )
2025-05-07T20:32:43.2503327Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2503416Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2503488Z         self,
2025-05-07T20:32:43.2503563Z         T: int,
2025-05-07T20:32:43.2503634Z         D: int,
2025-05-07T20:32:43.2503727Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2503816Z         contiguous: bool,
2025-05-07T20:32:43.2503895Z         compiled: bool,
2025-05-07T20:32:43.2503967Z     ) -> None:
2025-05-07T20:32:43.2504060Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2504129Z     
2025-05-07T20:32:43.2504292Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2504444Z     
2025-05-07T20:32:43.2504532Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2504654Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2504740Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2504812Z         x0 = x[:, :D]
2025-05-07T20:32:43.2504891Z         x1 = x[:, D:]
2025-05-07T20:32:43.2504957Z     
2025-05-07T20:32:43.2505034Z         if contiguous:
2025-05-07T20:32:43.2505126Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2505210Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2505278Z     
2025-05-07T20:32:43.2505363Z         if scale_ub is not None:
2025-05-07T20:32:43.2505464Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2505594Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2505674Z             )
2025-05-07T20:32:43.2505747Z         else:
2025-05-07T20:32:43.2505833Z             scale_ub_tensor = None
2025-05-07T20:32:43.2505909Z     
2025-05-07T20:32:43.2506032Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2506114Z             op = silu_mul_quant
2025-05-07T20:32:43.2506200Z             if compiled:
2025-05-07T20:32:43.2506294Z                 op = torch.compile(op)
2025-05-07T20:32:43.2506398Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2506468Z     
2025-05-07T20:32:43.2506554Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2506559Z 
2025-05-07T20:32:43.2506655Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2506779Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2506876Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2506972Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2507520Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2507615Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2507976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2508275Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2508613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2508701Z     kernel = self.compile(
2025-05-07T20:32:43.2509098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2509283Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2509455Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2509462Z 
2025-05-07T20:32:43.2509722Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c4badb50>
2025-05-07T20:32:43.2510566Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2511070Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c4aad440>}
2025-05-07T20:32:43.2511805Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2511988Z context = <triton._C.libtriton.ir.context object at 0x7f13c4ab9630>
2025-05-07T20:32:43.2511993Z 
2025-05-07T20:32:43.2512151Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2512404Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2512510Z                            module_map=module_map)
2025-05-07T20:32:43.2512762Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2512856Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2512932Z E       ^
2025-05-07T20:32:43.2513282Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2513286Z 
2025-05-07T20:32:43.2513697Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2513701Z 
2025-05-07T20:32:43.2513799Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2514014Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2514087Z     T=1,
2025-05-07T20:32:43.2514157Z     D=7168,
2025-05-07T20:32:43.2514235Z     scale_ub=1200.0,
2025-05-07T20:32:43.2514317Z     contiguous=False,
2025-05-07T20:32:43.2514397Z     compiled=False,
2025-05-07T20:32:43.2514470Z )
2025-05-07T20:32:43.2514682Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2514850Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:43.2514854Z 
2025-05-07T20:32:43.2514923Z     @given(
2025-05-07T20:32:43.2515037Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2515133Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2515248Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2515357Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2515464Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2515534Z     )
2025-05-07T20:32:43.2515774Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2515863Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2515939Z         self,
2025-05-07T20:32:43.2516011Z         T: int,
2025-05-07T20:32:43.2516082Z         D: int,
2025-05-07T20:32:43.2516181Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2516263Z         contiguous: bool,
2025-05-07T20:32:43.2516343Z         compiled: bool,
2025-05-07T20:32:43.2516503Z     ) -> None:
2025-05-07T20:32:43.2516597Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2516667Z     
2025-05-07T20:32:43.2516829Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2516896Z     
2025-05-07T20:32:43.2516983Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2517100Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2517185Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2517261Z         x0 = x[:, :D]
2025-05-07T20:32:43.2517332Z         x1 = x[:, D:]
2025-05-07T20:32:43.2517397Z     
2025-05-07T20:32:43.2517476Z         if contiguous:
2025-05-07T20:32:43.2517560Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2517641Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2517713Z     
2025-05-07T20:32:43.2517802Z         if scale_ub is not None:
2025-05-07T20:32:43.2517913Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2518042Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2518117Z             )
2025-05-07T20:32:43.2518190Z         else:
2025-05-07T20:32:43.2518277Z             scale_ub_tensor = None
2025-05-07T20:32:43.2518342Z     
2025-05-07T20:32:43.2518468Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2518549Z             op = silu_mul_quant
2025-05-07T20:32:43.2518630Z             if compiled:
2025-05-07T20:32:43.2518734Z                 op = torch.compile(op)
2025-05-07T20:32:43.2518834Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2518899Z     
2025-05-07T20:32:43.2518987Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2518991Z 
2025-05-07T20:32:43.2519081Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2519207Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2519385Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2519483Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2519976Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2520066Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2520418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2520635Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2520964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2521058Z     kernel = self.compile(
2025-05-07T20:32:43.2521435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2521603Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2521734Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2521739Z 
2025-05-07T20:32:43.2521938Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c4f0a7d0>
2025-05-07T20:32:43.2522703Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2523195Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c4aae7a0>}
2025-05-07T20:32:43.2523929Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2524123Z context = <triton._C.libtriton.ir.context object at 0x7f13c491c2b0>
2025-05-07T20:32:43.2524132Z 
2025-05-07T20:32:43.2524288Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2524645Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2524749Z                            module_map=module_map)
2025-05-07T20:32:43.2524903Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2524999Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2525072Z E       ^
2025-05-07T20:32:43.2525415Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2525424Z 
2025-05-07T20:32:43.2525834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2525838Z 
2025-05-07T20:32:43.2525937Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2526158Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2526230Z     T=4096,
2025-05-07T20:32:43.2526299Z     D=7168,
2025-05-07T20:32:43.2526388Z     scale_ub=1200.0,
2025-05-07T20:32:43.2526471Z     contiguous=False,
2025-05-07T20:32:43.2526548Z     compiled=True,
2025-05-07T20:32:43.2526620Z )
2025-05-07T20:32:43.2526833Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2527009Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:43.2527013Z 
2025-05-07T20:32:43.2527085Z     @given(
2025-05-07T20:32:43.2527198Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2527303Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2527433Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2527561Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2527679Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2527835Z     )
2025-05-07T20:32:43.2528086Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2528182Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2528253Z         self,
2025-05-07T20:32:43.2528331Z         T: int,
2025-05-07T20:32:43.2528403Z         D: int,
2025-05-07T20:32:43.2528492Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2528578Z         contiguous: bool,
2025-05-07T20:32:43.2528660Z         compiled: bool,
2025-05-07T20:32:43.2528735Z     ) -> None:
2025-05-07T20:32:43.2528832Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2528899Z     
2025-05-07T20:32:43.2529061Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2529132Z     
2025-05-07T20:32:43.2529219Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2529338Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2529424Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2529499Z         x0 = x[:, :D]
2025-05-07T20:32:43.2529582Z         x1 = x[:, D:]
2025-05-07T20:32:43.2529651Z     
2025-05-07T20:32:43.2529729Z         if contiguous:
2025-05-07T20:32:43.2529821Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2529904Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2529971Z     
2025-05-07T20:32:43.2530056Z         if scale_ub is not None:
2025-05-07T20:32:43.2530156Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2530285Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2530356Z             )
2025-05-07T20:32:43.2530426Z         else:
2025-05-07T20:32:43.2530513Z             scale_ub_tensor = None
2025-05-07T20:32:43.2530584Z     
2025-05-07T20:32:43.2530706Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2530800Z             op = silu_mul_quant
2025-05-07T20:32:43.2530881Z             if compiled:
2025-05-07T20:32:43.2530975Z                 op = torch.compile(op)
2025-05-07T20:32:43.2531079Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2531150Z     
2025-05-07T20:32:43.2531233Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2531237Z 
2025-05-07T20:32:43.2531417Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2531540Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2531635Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2531730Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2532093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2532184Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2532668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2532763Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2533118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2533336Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2533676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2533768Z     kernel = self.compile(
2025-05-07T20:32:43.2534164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2534335Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2534456Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2534461Z 
2025-05-07T20:32:43.2534654Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c4f4a050>
2025-05-07T20:32:43.2535421Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2535996Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c4aafa60>}
2025-05-07T20:32:43.2536737Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2536920Z context = <triton._C.libtriton.ir.context object at 0x7f13c491b5f0>
2025-05-07T20:32:43.2536925Z 
2025-05-07T20:32:43.2537085Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2537339Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2537461Z                            module_map=module_map)
2025-05-07T20:32:43.2537647Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2537749Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2537823Z E       ^
2025-05-07T20:32:43.2538178Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2538183Z 
2025-05-07T20:32:43.2538591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2538596Z 
2025-05-07T20:32:43.2538696Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2538910Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2538979Z     T=128,
2025-05-07T20:32:43.2539054Z     D=7168,
2025-05-07T20:32:43.2539133Z     scale_ub=1200.0,
2025-05-07T20:32:43.2539217Z     contiguous=False,
2025-05-07T20:32:43.2539296Z     compiled=True,
2025-05-07T20:32:43.2539362Z )
2025-05-07T20:32:43.2539575Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2539748Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:43.2539753Z 
2025-05-07T20:32:43.2539823Z     @given(
2025-05-07T20:32:43.2540014Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2540352Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2540465Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2540577Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2540686Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2540755Z     )
2025-05-07T20:32:43.2540998Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2541086Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2541164Z         self,
2025-05-07T20:32:43.2541235Z         T: int,
2025-05-07T20:32:43.2541306Z         D: int,
2025-05-07T20:32:43.2541403Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2541487Z         contiguous: bool,
2025-05-07T20:32:43.2541573Z         compiled: bool,
2025-05-07T20:32:43.2541649Z     ) -> None:
2025-05-07T20:32:43.2541736Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2541804Z     
2025-05-07T20:32:43.2541976Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2542045Z     
2025-05-07T20:32:43.2542129Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2542251Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2542332Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2542406Z         x0 = x[:, :D]
2025-05-07T20:32:43.2542484Z         x1 = x[:, D:]
2025-05-07T20:32:43.2542549Z     
2025-05-07T20:32:43.2542630Z         if contiguous:
2025-05-07T20:32:43.2542717Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2542800Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2542869Z     
2025-05-07T20:32:43.2542951Z         if scale_ub is not None:
2025-05-07T20:32:43.2543050Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2547608Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2547698Z             )
2025-05-07T20:32:43.2547777Z         else:
2025-05-07T20:32:43.2547882Z             scale_ub_tensor = None
2025-05-07T20:32:43.2547952Z     
2025-05-07T20:32:43.2548086Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2548179Z             op = silu_mul_quant
2025-05-07T20:32:43.2548263Z             if compiled:
2025-05-07T20:32:43.2548363Z                 op = torch.compile(op)
2025-05-07T20:32:43.2548466Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2548538Z     
2025-05-07T20:32:43.2548629Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2548634Z 
2025-05-07T20:32:43.2548731Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2548862Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2548966Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2549064Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2549440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2549531Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2550031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2550132Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2550487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2550706Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2551048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2551141Z     kernel = self.compile(
2025-05-07T20:32:43.2551530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2551705Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2551984Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2551989Z 
2025-05-07T20:32:43.2552196Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c44e7ad0>
2025-05-07T20:32:43.2552960Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2553455Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c47c0d60>}
2025-05-07T20:32:43.2554188Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2554380Z context = <triton._C.libtriton.ir.context object at 0x7f13c47e5ef0>
2025-05-07T20:32:43.2554385Z 
2025-05-07T20:32:43.2554556Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2554812Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2554923Z                            module_map=module_map)
2025-05-07T20:32:43.2555083Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2555178Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2555260Z E       ^
2025-05-07T20:32:43.2555606Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2555611Z 
2025-05-07T20:32:43.2556027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2556036Z 
2025-05-07T20:32:43.2556214Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2556431Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2556510Z     T=2048,
2025-05-07T20:32:43.2556587Z     D=7168,
2025-05-07T20:32:43.2556666Z     scale_ub=None,
2025-05-07T20:32:43.2556755Z     contiguous=True,
2025-05-07T20:32:43.2556833Z     compiled=True,
2025-05-07T20:32:43.2556899Z )
2025-05-07T20:32:43.2557116Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2557282Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:43.2557287Z 
2025-05-07T20:32:43.2557357Z     @given(
2025-05-07T20:32:43.2557469Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2557565Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2557676Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2557788Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2557902Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2557976Z     )
2025-05-07T20:32:43.2558217Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2558306Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2558381Z         self,
2025-05-07T20:32:43.2558454Z         T: int,
2025-05-07T20:32:43.2558527Z         D: int,
2025-05-07T20:32:43.2558621Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2558704Z         contiguous: bool,
2025-05-07T20:32:43.2558785Z         compiled: bool,
2025-05-07T20:32:43.2558857Z     ) -> None:
2025-05-07T20:32:43.2558947Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2559017Z     
2025-05-07T20:32:43.2559178Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2559249Z     
2025-05-07T20:32:43.2559336Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2559455Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2559545Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2559623Z         x0 = x[:, :D]
2025-05-07T20:32:43.2559694Z         x1 = x[:, D:]
2025-05-07T20:32:43.2559760Z     
2025-05-07T20:32:43.2559923Z         if contiguous:
2025-05-07T20:32:43.2560011Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2560096Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2560166Z     
2025-05-07T20:32:43.2560249Z         if scale_ub is not None:
2025-05-07T20:32:43.2560354Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2560482Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2560555Z             )
2025-05-07T20:32:43.2560630Z         else:
2025-05-07T20:32:43.2560717Z             scale_ub_tensor = None
2025-05-07T20:32:43.2560781Z     
2025-05-07T20:32:43.2560910Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2560993Z             op = silu_mul_quant
2025-05-07T20:32:43.2561072Z             if compiled:
2025-05-07T20:32:43.2561169Z                 op = torch.compile(op)
2025-05-07T20:32:43.2561273Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2561344Z     
2025-05-07T20:32:43.2561431Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2561435Z 
2025-05-07T20:32:43.2561524Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2561650Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2561744Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2561837Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2562206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2562293Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2562775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2562871Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2563220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2563523Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2563857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2563943Z     kernel = self.compile(
2025-05-07T20:32:43.2564325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2564493Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2564616Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2564620Z 
2025-05-07T20:32:43.2564814Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c4f0acd0>
2025-05-07T20:32:43.2565570Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2566075Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c47c1b20>}
2025-05-07T20:32:43.2566805Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2566993Z context = <triton._C.libtriton.ir.context object at 0x7f13c47e09b0>
2025-05-07T20:32:43.2566997Z 
2025-05-07T20:32:43.2567154Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2567406Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2567513Z                            module_map=module_map)
2025-05-07T20:32:43.2567672Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2567766Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2567838Z E       ^
2025-05-07T20:32:43.2568262Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2568267Z 
2025-05-07T20:32:43.2568702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2568707Z 
2025-05-07T20:32:43.2568807Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2569028Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2569100Z     T=16384,
2025-05-07T20:32:43.2569174Z     D=5120,
2025-05-07T20:32:43.2569254Z     scale_ub=None,
2025-05-07T20:32:43.2569336Z     contiguous=False,
2025-05-07T20:32:43.2569411Z     compiled=False,
2025-05-07T20:32:43.2569486Z )
2025-05-07T20:32:43.2569697Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2569876Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:43.2569881Z 
2025-05-07T20:32:43.2569961Z     @given(
2025-05-07T20:32:43.2570073Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2570170Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2570277Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2570386Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2570497Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2570567Z     )
2025-05-07T20:32:43.2570804Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2570895Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2570967Z         self,
2025-05-07T20:32:43.2571056Z         T: int,
2025-05-07T20:32:43.2571163Z         D: int,
2025-05-07T20:32:43.2571294Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2571524Z         contiguous: bool,
2025-05-07T20:32:43.2571647Z         compiled: bool,
2025-05-07T20:32:43.2571719Z     ) -> None:
2025-05-07T20:32:43.2571814Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2571879Z     
2025-05-07T20:32:43.2572039Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2572108Z     
2025-05-07T20:32:43.2572194Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2572312Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2574106Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 40.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2574120Z 
2025-05-07T20:32:43.2574240Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:43.2574245Z 
2025-05-07T20:32:43.2574344Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2574561Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2574631Z     T=4096,
2025-05-07T20:32:43.2574705Z     D=7168,
2025-05-07T20:32:43.2574782Z     scale_ub=1200.0,
2025-05-07T20:32:43.2574860Z     contiguous=True,
2025-05-07T20:32:43.2574937Z     compiled=True,
2025-05-07T20:32:43.2575004Z )
2025-05-07T20:32:43.2575219Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2575384Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:43.2575389Z 
2025-05-07T20:32:43.2575456Z     @given(
2025-05-07T20:32:43.2575572Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2575668Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2575774Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2575972Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2576080Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2576152Z     )
2025-05-07T20:32:43.2576387Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2576474Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2576549Z         self,
2025-05-07T20:32:43.2576620Z         T: int,
2025-05-07T20:32:43.2576689Z         D: int,
2025-05-07T20:32:43.2576781Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2576865Z         contiguous: bool,
2025-05-07T20:32:43.2576945Z         compiled: bool,
2025-05-07T20:32:43.2577023Z     ) -> None:
2025-05-07T20:32:43.2577111Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2577179Z     
2025-05-07T20:32:43.2577342Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2577417Z     
2025-05-07T20:32:43.2577506Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2577627Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2579387Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2579396Z 
2025-05-07T20:32:43.2579509Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:43.2579515Z 
2025-05-07T20:32:43.2579612Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2579934Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2580004Z     T=16384,
2025-05-07T20:32:43.2580085Z     D=7168,
2025-05-07T20:32:43.2580164Z     scale_ub=None,
2025-05-07T20:32:43.2580242Z     contiguous=False,
2025-05-07T20:32:43.2580324Z     compiled=False,
2025-05-07T20:32:43.2580398Z )
2025-05-07T20:32:43.2580608Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2580783Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:43.2580788Z 
2025-05-07T20:32:43.2580863Z     @given(
2025-05-07T20:32:43.2580975Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2581072Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2581180Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2581294Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2581405Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2581481Z     )
2025-05-07T20:32:43.2581719Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2581818Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2581889Z         self,
2025-05-07T20:32:43.2581967Z         T: int,
2025-05-07T20:32:43.2582039Z         D: int,
2025-05-07T20:32:43.2582131Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2582216Z         contiguous: bool,
2025-05-07T20:32:43.2582295Z         compiled: bool,
2025-05-07T20:32:43.2582372Z     ) -> None:
2025-05-07T20:32:43.2582465Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2582533Z     
2025-05-07T20:32:43.2582693Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2584541Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 144.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 136.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2584553Z 
2025-05-07T20:32:43.2584666Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.2584670Z 
2025-05-07T20:32:43.2584769Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2584985Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2585064Z     T=2048,
2025-05-07T20:32:43.2585136Z     D=7168,
2025-05-07T20:32:43.2585215Z     scale_ub=1200.0,
2025-05-07T20:32:43.2585299Z     contiguous=True,
2025-05-07T20:32:43.2585379Z     compiled=True,
2025-05-07T20:32:43.2585449Z )
2025-05-07T20:32:43.2585661Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2585828Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:43.2585832Z 
2025-05-07T20:32:43.2585909Z     @given(
2025-05-07T20:32:43.2586025Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2586116Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2586229Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2586339Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2586445Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2586517Z     )
2025-05-07T20:32:43.2586753Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2586841Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2586921Z         self,
2025-05-07T20:32:43.2586992Z         T: int,
2025-05-07T20:32:43.2587062Z         D: int,
2025-05-07T20:32:43.2587160Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2587247Z         contiguous: bool,
2025-05-07T20:32:43.2587465Z         compiled: bool,
2025-05-07T20:32:43.2587546Z     ) -> None:
2025-05-07T20:32:43.2587633Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2587708Z     
2025-05-07T20:32:43.2587868Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2587940Z     
2025-05-07T20:32:43.2588030Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2588148Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2589889Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2589903Z 
2025-05-07T20:32:43.2590018Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:43.2590027Z 
2025-05-07T20:32:43.2590126Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2590349Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2590422Z     T=2048,
2025-05-07T20:32:43.2590494Z     D=7168,
2025-05-07T20:32:43.2590574Z     scale_ub=None,
2025-05-07T20:32:43.2590650Z     contiguous=True,
2025-05-07T20:32:43.2590734Z     compiled=False,
2025-05-07T20:32:43.2590801Z )
2025-05-07T20:32:43.2591011Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2591177Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:43.2591181Z 
2025-05-07T20:32:43.2591256Z     @given(
2025-05-07T20:32:43.2591366Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2591464Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2591573Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2591768Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2591881Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2591952Z     )
2025-05-07T20:32:43.2592196Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2592284Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2592358Z         self,
2025-05-07T20:32:43.2592433Z         T: int,
2025-05-07T20:32:43.2592507Z         D: int,
2025-05-07T20:32:43.2592604Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2592697Z         contiguous: bool,
2025-05-07T20:32:43.2592777Z         compiled: bool,
2025-05-07T20:32:43.2592850Z     ) -> None:
2025-05-07T20:32:43.2592944Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2593011Z     
2025-05-07T20:32:43.2593169Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2593247Z     
2025-05-07T20:32:43.2593331Z >       x_sign = torch.sign(x)
2025-05-07T20:32:43.2595080Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 32.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 80.52 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2595086Z 
2025-05-07T20:32:43.2595198Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:43.2595202Z 
2025-05-07T20:32:43.2595301Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2595515Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2595662Z     T=1,
2025-05-07T20:32:43.2595737Z     D=7168,
2025-05-07T20:32:43.2595811Z     scale_ub=1200.0,
2025-05-07T20:32:43.2595890Z     contiguous=True,
2025-05-07T20:32:43.2595974Z     compiled=False,
2025-05-07T20:32:43.2596042Z )
2025-05-07T20:32:43.2596250Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2596411Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:43.2596416Z 
2025-05-07T20:32:43.2596492Z     @given(
2025-05-07T20:32:43.2596611Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2596705Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2596814Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2596927Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2597035Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2597105Z     )
2025-05-07T20:32:43.2597343Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2597435Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2597504Z         self,
2025-05-07T20:32:43.2597586Z         T: int,
2025-05-07T20:32:43.2597656Z         D: int,
2025-05-07T20:32:43.2597751Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2597833Z         contiguous: bool,
2025-05-07T20:32:43.2597914Z         compiled: bool,
2025-05-07T20:32:43.2597994Z     ) -> None:
2025-05-07T20:32:43.2598082Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2598150Z     
2025-05-07T20:32:43.2598314Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2598383Z     
2025-05-07T20:32:43.2598469Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2598590Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2598676Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2598752Z         x0 = x[:, :D]
2025-05-07T20:32:43.2598830Z         x1 = x[:, D:]
2025-05-07T20:32:43.2598907Z     
2025-05-07T20:32:43.2598985Z         if contiguous:
2025-05-07T20:32:43.2599078Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2599162Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2599311Z     
2025-05-07T20:32:43.2599396Z         if scale_ub is not None:
2025-05-07T20:32:43.2599496Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2599629Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2599700Z             )
2025-05-07T20:32:43.2599770Z         else:
2025-05-07T20:32:43.2599864Z             scale_ub_tensor = None
2025-05-07T20:32:43.2599927Z     
2025-05-07T20:32:43.2600051Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2600139Z             op = silu_mul_quant
2025-05-07T20:32:43.2600220Z             if compiled:
2025-05-07T20:32:43.2600313Z                 op = torch.compile(op)
2025-05-07T20:32:43.2600417Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2600486Z     
2025-05-07T20:32:43.2600580Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2600585Z 
2025-05-07T20:32:43.2600675Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2600805Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2600903Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2600994Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2601487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2601581Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2601935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2602154Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2602487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2602728Z     kernel = self.compile(
2025-05-07T20:32:43.2603126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2603302Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2603424Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2603432Z 
2025-05-07T20:32:43.2603630Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c497a0d0>
2025-05-07T20:32:43.2604392Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2604887Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c4528a40>}
2025-05-07T20:32:43.2605623Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2605818Z context = <triton._C.libtriton.ir.context object at 0x7f13c4569a30>
2025-05-07T20:32:43.2605823Z 
2025-05-07T20:32:43.2605980Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2606233Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2606338Z                            module_map=module_map)
2025-05-07T20:32:43.2606495Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2606589Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2606661Z E       ^
2025-05-07T20:32:43.2607006Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2607011Z 
2025-05-07T20:32:43.2607437Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2607442Z 
2025-05-07T20:32:43.2607645Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2607862Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2607941Z     T=128,
2025-05-07T20:32:43.2608016Z     D=5120,
2025-05-07T20:32:43.2608101Z     scale_ub=None,
2025-05-07T20:32:43.2608185Z     contiguous=True,
2025-05-07T20:32:43.2608263Z     compiled=False,
2025-05-07T20:32:43.2608334Z )
2025-05-07T20:32:43.2608544Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2608709Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:43.2608714Z 
2025-05-07T20:32:43.2608789Z     @given(
2025-05-07T20:32:43.2608901Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2608998Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2609113Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2609224Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2609338Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2609409Z     )
2025-05-07T20:32:43.2609645Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2609738Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2609810Z         self,
2025-05-07T20:32:43.2609881Z         T: int,
2025-05-07T20:32:43.2609953Z         D: int,
2025-05-07T20:32:43.2610046Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2610131Z         contiguous: bool,
2025-05-07T20:32:43.2610214Z         compiled: bool,
2025-05-07T20:32:43.2610286Z     ) -> None:
2025-05-07T20:32:43.2610382Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2610452Z     
2025-05-07T20:32:43.2610613Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2610765Z     
2025-05-07T20:32:43.2610850Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2610968Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2611060Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2611137Z         x0 = x[:, :D]
2025-05-07T20:32:43.2611215Z         x1 = x[:, D:]
2025-05-07T20:32:43.2611291Z     
2025-05-07T20:32:43.2611367Z         if contiguous:
2025-05-07T20:32:43.2611450Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2611535Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2611603Z     
2025-05-07T20:32:43.2611687Z         if scale_ub is not None:
2025-05-07T20:32:43.2611790Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2611917Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2611990Z             )
2025-05-07T20:32:43.2612064Z         else:
2025-05-07T20:32:43.2612152Z             scale_ub_tensor = None
2025-05-07T20:32:43.2612223Z     
2025-05-07T20:32:43.2612345Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2612434Z             op = silu_mul_quant
2025-05-07T20:32:43.2612516Z             if compiled:
2025-05-07T20:32:43.2612614Z                 op = torch.compile(op)
2025-05-07T20:32:43.2612714Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2612784Z     
2025-05-07T20:32:43.2612870Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2612874Z 
2025-05-07T20:32:43.2612968Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2613094Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2613189Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2613283Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2613780Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2613872Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2614235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2614457Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2614877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2614967Z     kernel = self.compile(
2025-05-07T20:32:43.2615360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2615533Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2615653Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2615657Z 
2025-05-07T20:32:43.2615854Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c5d5f150>
2025-05-07T20:32:43.2616617Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2617116Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c4529940>}
2025-05-07T20:32:43.2617849Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2618031Z context = <triton._C.libtriton.ir.context object at 0x7f13c4320670>
2025-05-07T20:32:43.2618036Z 
2025-05-07T20:32:43.2618195Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2618451Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2618555Z                            module_map=module_map)
2025-05-07T20:32:43.2618716Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2618886Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2618958Z E       ^
2025-05-07T20:32:43.2619313Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2619318Z 
2025-05-07T20:32:43.2619747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2619751Z 
2025-05-07T20:32:43.2619857Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2620073Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2620146Z     T=128,
2025-05-07T20:32:43.2620221Z     D=7168,
2025-05-07T20:32:43.2620297Z     scale_ub=None,
2025-05-07T20:32:43.2620377Z     contiguous=True,
2025-05-07T20:32:43.2620458Z     compiled=False,
2025-05-07T20:32:43.2620524Z )
2025-05-07T20:32:43.2620738Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2620907Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:43.2620912Z 
2025-05-07T20:32:43.2620987Z     @given(
2025-05-07T20:32:43.2621104Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2621199Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2621307Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2621421Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2621529Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2621596Z     )
2025-05-07T20:32:43.2621835Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2621925Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2622002Z         self,
2025-05-07T20:32:43.2622076Z         T: int,
2025-05-07T20:32:43.2622149Z         D: int,
2025-05-07T20:32:43.2622250Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2622341Z         contiguous: bool,
2025-05-07T20:32:43.2622425Z         compiled: bool,
2025-05-07T20:32:43.2622505Z     ) -> None:
2025-05-07T20:32:43.2622594Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2622737Z     
2025-05-07T20:32:43.2622903Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2622975Z     
2025-05-07T20:32:43.2623059Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2623180Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2623263Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2623340Z         x0 = x[:, :D]
2025-05-07T20:32:43.2623414Z         x1 = x[:, D:]
2025-05-07T20:32:43.2623485Z     
2025-05-07T20:32:43.2623573Z         if contiguous:
2025-05-07T20:32:43.2623658Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2623742Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2623815Z     
2025-05-07T20:32:43.2623901Z         if scale_ub is not None:
2025-05-07T20:32:43.2624000Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2624135Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2624210Z             )
2025-05-07T20:32:43.2624285Z         else:
2025-05-07T20:32:43.2624382Z             scale_ub_tensor = None
2025-05-07T20:32:43.2624447Z     
2025-05-07T20:32:43.2624578Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2624663Z             op = silu_mul_quant
2025-05-07T20:32:43.2624744Z             if compiled:
2025-05-07T20:32:43.2624845Z                 op = torch.compile(op)
2025-05-07T20:32:43.2624946Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2625017Z     
2025-05-07T20:32:43.2625103Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2625108Z 
2025-05-07T20:32:43.2625199Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2625321Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2625423Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2625518Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2626098Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2626193Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2626544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2626760Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2627094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2627181Z     kernel = self.compile(
2025-05-07T20:32:43.2627626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2627795Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2627920Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2627932Z 
2025-05-07T20:32:43.2628128Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c49786d0>
2025-05-07T20:32:43.2628892Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2629387Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c452a700>}
2025-05-07T20:32:43.2630117Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2630301Z context = <triton._C.libtriton.ir.context object at 0x7f13c4363fb0>
2025-05-07T20:32:43.2630311Z 
2025-05-07T20:32:43.2630467Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2630796Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2630901Z                            module_map=module_map)
2025-05-07T20:32:43.2631057Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2631154Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2631228Z E       ^
2025-05-07T20:32:43.2631573Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2631578Z 
2025-05-07T20:32:43.2631991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2631995Z 
2025-05-07T20:32:43.2632093Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2632311Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2632387Z     T=2048,
2025-05-07T20:32:43.2632460Z     D=7168,
2025-05-07T20:32:43.2632541Z     scale_ub=1200.0,
2025-05-07T20:32:43.2632621Z     contiguous=True,
2025-05-07T20:32:43.2632701Z     compiled=False,
2025-05-07T20:32:43.2632770Z )
2025-05-07T20:32:43.2632980Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2633146Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:43.2633154Z 
2025-05-07T20:32:43.2633232Z     @given(
2025-05-07T20:32:43.2633350Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2633446Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2633554Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2633665Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2633775Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2633843Z     )
2025-05-07T20:32:43.2634080Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2634275Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2634353Z         self,
2025-05-07T20:32:43.2634432Z         T: int,
2025-05-07T20:32:43.2634505Z         D: int,
2025-05-07T20:32:43.2634596Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2634680Z         contiguous: bool,
2025-05-07T20:32:43.2634759Z         compiled: bool,
2025-05-07T20:32:43.2634834Z     ) -> None:
2025-05-07T20:32:43.2634929Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2634997Z     
2025-05-07T20:32:43.2635158Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2636931Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.70 GiB is allocated by PyTorch, and 53.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2636945Z 
2025-05-07T20:32:43.2637057Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.2637062Z 
2025-05-07T20:32:43.2637164Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2637381Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2637450Z     T=1,
2025-05-07T20:32:43.2637519Z     D=5120,
2025-05-07T20:32:43.2637595Z     scale_ub=1200.0,
2025-05-07T20:32:43.2637671Z     contiguous=True,
2025-05-07T20:32:43.2637749Z     compiled=False,
2025-05-07T20:32:43.2637814Z )
2025-05-07T20:32:43.2638029Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2638189Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:43.2638198Z 
2025-05-07T20:32:43.2638273Z     @given(
2025-05-07T20:32:43.2638388Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2638586Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2638696Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2638809Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2638915Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2638984Z     )
2025-05-07T20:32:43.2639218Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2639306Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2639381Z         self,
2025-05-07T20:32:43.2639452Z         T: int,
2025-05-07T20:32:43.2639526Z         D: int,
2025-05-07T20:32:43.2639622Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2639707Z         contiguous: bool,
2025-05-07T20:32:43.2639784Z         compiled: bool,
2025-05-07T20:32:43.2639860Z     ) -> None:
2025-05-07T20:32:43.2639951Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2640020Z     
2025-05-07T20:32:43.2640424Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2640497Z     
2025-05-07T20:32:43.2640584Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2640708Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2640792Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2640869Z         x0 = x[:, :D]
2025-05-07T20:32:43.2640942Z         x1 = x[:, D:]
2025-05-07T20:32:43.2641011Z     
2025-05-07T20:32:43.2641090Z         if contiguous:
2025-05-07T20:32:43.2641178Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2641260Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2641334Z     
2025-05-07T20:32:43.2641418Z         if scale_ub is not None:
2025-05-07T20:32:43.2641518Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2641651Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2641862Z             )
2025-05-07T20:32:43.2641932Z         else:
2025-05-07T20:32:43.2642024Z             scale_ub_tensor = None
2025-05-07T20:32:43.2642094Z     
2025-05-07T20:32:43.2642229Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2642315Z             op = silu_mul_quant
2025-05-07T20:32:43.2642394Z             if compiled:
2025-05-07T20:32:43.2642488Z                 op = torch.compile(op)
2025-05-07T20:32:43.2642587Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2642656Z     
2025-05-07T20:32:43.2642748Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2642752Z 
2025-05-07T20:32:43.2642841Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2642963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2643060Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2643153Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2643641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2643739Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2644095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2644313Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2644649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2644738Z     kernel = self.compile(
2025-05-07T20:32:43.2645133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2645300Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2645426Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2645431Z 
2025-05-07T20:32:43.2645627Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c5d5f050>
2025-05-07T20:32:43.2646506Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2647002Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c452bce0>}
2025-05-07T20:32:43.2647733Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2647921Z context = <triton._C.libtriton.ir.context object at 0x7f13c43b73b0>
2025-05-07T20:32:43.2647926Z 
2025-05-07T20:32:43.2648086Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2648343Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2648447Z                            module_map=module_map)
2025-05-07T20:32:43.2648608Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2648704Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2648774Z E       ^
2025-05-07T20:32:43.2649120Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2649124Z 
2025-05-07T20:32:43.2649536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2649541Z 
2025-05-07T20:32:43.2649635Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2649853Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2649925Z     T=2048,
2025-05-07T20:32:43.2649992Z     D=5120,
2025-05-07T20:32:43.2650068Z     scale_ub=None,
2025-05-07T20:32:43.2650227Z     contiguous=True,
2025-05-07T20:32:43.2650309Z     compiled=False,
2025-05-07T20:32:43.2650385Z )
2025-05-07T20:32:43.2650600Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2650767Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:43.2650776Z 
2025-05-07T20:32:43.2650847Z     @given(
2025-05-07T20:32:43.2650959Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2651053Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2651162Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2651272Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2651383Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2651451Z     )
2025-05-07T20:32:43.2651691Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2651783Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2651860Z         self,
2025-05-07T20:32:43.2651930Z         T: int,
2025-05-07T20:32:43.2652002Z         D: int,
2025-05-07T20:32:43.2652092Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2652181Z         contiguous: bool,
2025-05-07T20:32:43.2652261Z         compiled: bool,
2025-05-07T20:32:43.2652334Z     ) -> None:
2025-05-07T20:32:43.2652425Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2652493Z     
2025-05-07T20:32:43.2652654Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2652725Z     
2025-05-07T20:32:43.2652810Z >       x_sign = torch.sign(x)
2025-05-07T20:32:43.2654642Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2654653Z 
2025-05-07T20:32:43.2654765Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:43.2654770Z 
2025-05-07T20:32:43.2654866Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2655083Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2655153Z     T=16384,
2025-05-07T20:32:43.2655223Z     D=5120,
2025-05-07T20:32:43.2655298Z     scale_ub=None,
2025-05-07T20:32:43.2655375Z     contiguous=True,
2025-05-07T20:32:43.2655453Z     compiled=False,
2025-05-07T20:32:43.2655521Z )
2025-05-07T20:32:43.2655734Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2655907Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:43.2655911Z 
2025-05-07T20:32:43.2655986Z     @given(
2025-05-07T20:32:43.2656096Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2656191Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2656304Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2656416Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2656522Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2656589Z     )
2025-05-07T20:32:43.2656828Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2656913Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2656984Z         self,
2025-05-07T20:32:43.2657058Z         T: int,
2025-05-07T20:32:43.2657130Z         D: int,
2025-05-07T20:32:43.2657220Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2657309Z         contiguous: bool,
2025-05-07T20:32:43.2657408Z         compiled: bool,
2025-05-07T20:32:43.2657487Z     ) -> None:
2025-05-07T20:32:43.2657601Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2657748Z     
2025-05-07T20:32:43.2657910Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2659665Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2659671Z 
2025-05-07T20:32:43.2659792Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.2659796Z 
2025-05-07T20:32:43.2659890Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2660104Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2660183Z     T=4096,
2025-05-07T20:32:43.2660253Z     D=5120,
2025-05-07T20:32:43.2660330Z     scale_ub=None,
2025-05-07T20:32:43.2660416Z     contiguous=True,
2025-05-07T20:32:43.2660495Z     compiled=False,
2025-05-07T20:32:43.2660565Z )
2025-05-07T20:32:43.2660780Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2660941Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:43.2660946Z 
2025-05-07T20:32:43.2661020Z     @given(
2025-05-07T20:32:43.2661129Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2661219Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2661329Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2661437Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2661543Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2661616Z     )
2025-05-07T20:32:43.2661857Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2661944Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2662098Z         self,
2025-05-07T20:32:43.2662168Z         T: int,
2025-05-07T20:32:43.2662245Z         D: int,
2025-05-07T20:32:43.2662340Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2662426Z         contiguous: bool,
2025-05-07T20:32:43.2662509Z         compiled: bool,
2025-05-07T20:32:43.2662582Z     ) -> None:
2025-05-07T20:32:43.2662671Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2662740Z     
2025-05-07T20:32:43.2662901Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2664650Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2664660Z 
2025-05-07T20:32:43.2664771Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.2664775Z 
2025-05-07T20:32:43.2664873Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2665091Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2665164Z     T=2048,
2025-05-07T20:32:43.2665238Z     D=5120,
2025-05-07T20:32:43.2665315Z     scale_ub=None,
2025-05-07T20:32:43.2665399Z     contiguous=False,
2025-05-07T20:32:43.2665483Z     compiled=False,
2025-05-07T20:32:43.2665551Z )
2025-05-07T20:32:43.2665761Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2665928Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:43.2666034Z 
2025-05-07T20:32:43.2670126Z     @given(
2025-05-07T20:32:43.2670267Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2670365Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2670475Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2670584Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2670694Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2670765Z     )
2025-05-07T20:32:43.2671011Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2671100Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2671172Z         self,
2025-05-07T20:32:43.2671247Z         T: int,
2025-05-07T20:32:43.2671318Z         D: int,
2025-05-07T20:32:43.2671409Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2671497Z         contiguous: bool,
2025-05-07T20:32:43.2671576Z         compiled: bool,
2025-05-07T20:32:43.2671657Z     ) -> None:
2025-05-07T20:32:43.2671750Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2671818Z     
2025-05-07T20:32:43.2671983Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2673731Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2673738Z 
2025-05-07T20:32:43.2673850Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.2673860Z 
2025-05-07T20:32:43.2673953Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2674172Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2674249Z     T=4096,
2025-05-07T20:32:43.2674429Z     D=7168,
2025-05-07T20:32:43.2674508Z     scale_ub=None,
2025-05-07T20:32:43.2674589Z     contiguous=True,
2025-05-07T20:32:43.2674668Z     compiled=True,
2025-05-07T20:32:43.2674737Z )
2025-05-07T20:32:43.2674948Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2675109Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:43.2675114Z 
2025-05-07T20:32:43.2675190Z     @given(
2025-05-07T20:32:43.2675301Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2675395Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2675510Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2675622Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2675730Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2675808Z     )
2025-05-07T20:32:43.2676044Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2676137Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2676216Z         self,
2025-05-07T20:32:43.2676288Z         T: int,
2025-05-07T20:32:43.2676357Z         D: int,
2025-05-07T20:32:43.2676451Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2676535Z         contiguous: bool,
2025-05-07T20:32:43.2676621Z         compiled: bool,
2025-05-07T20:32:43.2676696Z     ) -> None:
2025-05-07T20:32:43.2676785Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2676857Z     
2025-05-07T20:32:43.2677016Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2678768Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2678858Z 
2025-05-07T20:32:43.2678969Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.2678974Z 
2025-05-07T20:32:43.2679069Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2679290Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2679366Z     T=2048,
2025-05-07T20:32:43.2679441Z     D=5120,
2025-05-07T20:32:43.2679523Z     scale_ub=1200.0,
2025-05-07T20:32:43.2679601Z     contiguous=False,
2025-05-07T20:32:43.2679683Z     compiled=False,
2025-05-07T20:32:43.2679747Z )
2025-05-07T20:32:43.2679955Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2680137Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:43.2680142Z 
2025-05-07T20:32:43.2680216Z     @given(
2025-05-07T20:32:43.2680330Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2680427Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2680534Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2680644Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2680754Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2680826Z     )
2025-05-07T20:32:43.2681063Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2681150Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2681219Z         self,
2025-05-07T20:32:43.2681294Z         T: int,
2025-05-07T20:32:43.2681365Z         D: int,
2025-05-07T20:32:43.2681455Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2681540Z         contiguous: bool,
2025-05-07T20:32:43.2681626Z         compiled: bool,
2025-05-07T20:32:43.2681696Z     ) -> None:
2025-05-07T20:32:43.2681787Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2681935Z     
2025-05-07T20:32:43.2682097Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2683834Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2683840Z 
2025-05-07T20:32:43.2683949Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.2683961Z 
2025-05-07T20:32:43.2684055Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2684278Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2684349Z     T=4096,
2025-05-07T20:32:43.2684422Z     D=7168,
2025-05-07T20:32:43.2684499Z     scale_ub=1200.0,
2025-05-07T20:32:43.2684582Z     contiguous=True,
2025-05-07T20:32:43.2684661Z     compiled=False,
2025-05-07T20:32:43.2684731Z )
2025-05-07T20:32:43.2684946Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2685110Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:43.2685114Z 
2025-05-07T20:32:43.2685188Z     @given(
2025-05-07T20:32:43.2685298Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2685388Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2685499Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2685608Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2685798Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2685872Z     )
2025-05-07T20:32:43.2686109Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2686198Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2686273Z         self,
2025-05-07T20:32:43.2686342Z         T: int,
2025-05-07T20:32:43.2686413Z         D: int,
2025-05-07T20:32:43.2686508Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2686593Z         contiguous: bool,
2025-05-07T20:32:43.2686676Z         compiled: bool,
2025-05-07T20:32:43.2686750Z     ) -> None:
2025-05-07T20:32:43.2686838Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2686908Z     
2025-05-07T20:32:43.2687064Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2688810Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2688824Z 
2025-05-07T20:32:43.2688934Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.2688939Z 
2025-05-07T20:32:43.2689032Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2689250Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2689325Z     T=16384,
2025-05-07T20:32:43.2689395Z     D=7168,
2025-05-07T20:32:43.2689477Z     scale_ub=None,
2025-05-07T20:32:43.2689558Z     contiguous=False,
2025-05-07T20:32:43.2689640Z     compiled=True,
2025-05-07T20:32:43.2689710Z )
2025-05-07T20:32:43.2689923Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2690175Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:43.2690180Z 
2025-05-07T20:32:43.2690253Z     @given(
2025-05-07T20:32:43.2690363Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2690458Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2690566Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2690676Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2690788Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2690857Z     )
2025-05-07T20:32:43.2691095Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2691181Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2691252Z         self,
2025-05-07T20:32:43.2691327Z         T: int,
2025-05-07T20:32:43.2691398Z         D: int,
2025-05-07T20:32:43.2691493Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2691584Z         contiguous: bool,
2025-05-07T20:32:43.2691668Z         compiled: bool,
2025-05-07T20:32:43.2691746Z     ) -> None:
2025-05-07T20:32:43.2691839Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2691905Z     
2025-05-07T20:32:43.2692061Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2693805Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2693888Z 
2025-05-07T20:32:43.2693998Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.2694008Z 
2025-05-07T20:32:43.2694101Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2694320Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2694395Z     T=4096,
2025-05-07T20:32:43.2694463Z     D=7168,
2025-05-07T20:32:43.2694538Z     scale_ub=None,
2025-05-07T20:32:43.2694618Z     contiguous=True,
2025-05-07T20:32:43.2694697Z     compiled=False,
2025-05-07T20:32:43.2694762Z )
2025-05-07T20:32:43.2694975Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2695138Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:43.2695142Z 
2025-05-07T20:32:43.2695215Z     @given(
2025-05-07T20:32:43.2695330Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2695421Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2695531Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2695647Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2695752Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2695832Z     )
2025-05-07T20:32:43.2696066Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2696152Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2696228Z         self,
2025-05-07T20:32:43.2696298Z         T: int,
2025-05-07T20:32:43.2696369Z         D: int,
2025-05-07T20:32:43.2696462Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2696543Z         contiguous: bool,
2025-05-07T20:32:43.2696623Z         compiled: bool,
2025-05-07T20:32:43.2696694Z     ) -> None:
2025-05-07T20:32:43.2696782Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2696852Z     
2025-05-07T20:32:43.2697011Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2698879Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2698893Z 
2025-05-07T20:32:43.2699003Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.2699008Z 
2025-05-07T20:32:43.2699101Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2699319Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2699392Z     T=16384,
2025-05-07T20:32:43.2699462Z     D=7168,
2025-05-07T20:32:43.2699547Z     scale_ub=None,
2025-05-07T20:32:43.2699631Z     contiguous=True,
2025-05-07T20:32:43.2699716Z     compiled=False,
2025-05-07T20:32:43.2699787Z )
2025-05-07T20:32:43.2699993Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2700167Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:43.2700171Z 
2025-05-07T20:32:43.2700243Z     @given(
2025-05-07T20:32:43.2700353Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2700447Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2700554Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2700661Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2700770Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2700837Z     )
2025-05-07T20:32:43.2701075Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2701161Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2701232Z         self,
2025-05-07T20:32:43.2701411Z         T: int,
2025-05-07T20:32:43.2701485Z         D: int,
2025-05-07T20:32:43.2701576Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2701662Z         contiguous: bool,
2025-05-07T20:32:43.2701747Z         compiled: bool,
2025-05-07T20:32:43.2701822Z     ) -> None:
2025-05-07T20:32:43.2701912Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2701979Z     
2025-05-07T20:32:43.2702136Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2703877Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2703887Z 
2025-05-07T20:32:43.2704006Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.2704011Z 
2025-05-07T20:32:43.2704109Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2704323Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2704398Z     T=16384,
2025-05-07T20:32:43.2704470Z     D=7168,
2025-05-07T20:32:43.2704548Z     scale_ub=1200.0,
2025-05-07T20:32:43.2704633Z     contiguous=True,
2025-05-07T20:32:43.2704711Z     compiled=False,
2025-05-07T20:32:43.2704777Z )
2025-05-07T20:32:43.2704989Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2705156Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:43.2705160Z 
2025-05-07T20:32:43.2705236Z     @given(
2025-05-07T20:32:43.2705345Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2705434Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2705551Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2705740Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2705848Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2705919Z     )
2025-05-07T20:32:43.2706154Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2706242Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2706314Z         self,
2025-05-07T20:32:43.2706384Z         T: int,
2025-05-07T20:32:43.2706456Z         D: int,
2025-05-07T20:32:43.2706546Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2706629Z         contiguous: bool,
2025-05-07T20:32:43.2706712Z         compiled: bool,
2025-05-07T20:32:43.2706786Z     ) -> None:
2025-05-07T20:32:43.2706872Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2706944Z     
2025-05-07T20:32:43.2707101Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2708971Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 13.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2708981Z 
2025-05-07T20:32:43.2709091Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.2709096Z 
2025-05-07T20:32:43.2709193Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2709411Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2709484Z     T=128,
2025-05-07T20:32:43.2709556Z     D=5120,
2025-05-07T20:32:43.2709718Z     scale_ub=1200.0,
2025-05-07T20:32:43.2709798Z     contiguous=False,
2025-05-07T20:32:43.2709877Z     compiled=False,
2025-05-07T20:32:43.2709940Z )
2025-05-07T20:32:43.2710155Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2710322Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:43.2710326Z 
2025-05-07T20:32:43.2710399Z     @given(
2025-05-07T20:32:43.2710508Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2710605Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2710712Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2710819Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2710930Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2711000Z     )
2025-05-07T20:32:43.2711241Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2711329Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2711405Z         self,
2025-05-07T20:32:43.2711479Z         T: int,
2025-05-07T20:32:43.2711551Z         D: int,
2025-05-07T20:32:43.2711650Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2711737Z         contiguous: bool,
2025-05-07T20:32:43.2711816Z         compiled: bool,
2025-05-07T20:32:43.2711887Z     ) -> None:
2025-05-07T20:32:43.2711979Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2712046Z     
2025-05-07T20:32:43.2712208Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2712278Z     
2025-05-07T20:32:43.2712363Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2712488Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2712572Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2712646Z         x0 = x[:, :D]
2025-05-07T20:32:43.2712726Z         x1 = x[:, D:]
2025-05-07T20:32:43.2712792Z     
2025-05-07T20:32:43.2712869Z         if contiguous:
2025-05-07T20:32:43.2712957Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2713041Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2713106Z     
2025-05-07T20:32:43.2713195Z         if scale_ub is not None:
2025-05-07T20:32:43.2713374Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2713503Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2713581Z             )
2025-05-07T20:32:43.2713652Z         else:
2025-05-07T20:32:43.2713739Z             scale_ub_tensor = None
2025-05-07T20:32:43.2713805Z     
2025-05-07T20:32:43.2713927Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2714011Z             op = silu_mul_quant
2025-05-07T20:32:43.2714089Z             if compiled:
2025-05-07T20:32:43.2714180Z                 op = torch.compile(op)
2025-05-07T20:32:43.2714286Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2714351Z     
2025-05-07T20:32:43.2714433Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2714437Z 
2025-05-07T20:32:43.2714535Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2714656Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2714753Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2714847Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2715337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2715429Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2715782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2715998Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2716336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2716422Z     kernel = self.compile(
2025-05-07T20:32:43.2716822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2717071Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2717194Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2717199Z 
2025-05-07T20:32:43.2717399Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c5d5e650>
2025-05-07T20:32:43.2718159Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2718649Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c4483600>}
2025-05-07T20:32:43.2719378Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2719570Z context = <triton._C.libtriton.ir.context object at 0x7f13c46961f0>
2025-05-07T20:32:43.2719575Z 
2025-05-07T20:32:43.2719733Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2719987Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2720094Z                            module_map=module_map)
2025-05-07T20:32:43.2720249Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2720340Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2720412Z E       ^
2025-05-07T20:32:43.2720758Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2720763Z 
2025-05-07T20:32:43.2721174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2721183Z 
2025-05-07T20:32:43.2721278Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2721565Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2721645Z     T=2048,
2025-05-07T20:32:43.2721717Z     D=7168,
2025-05-07T20:32:43.2721790Z     scale_ub=None,
2025-05-07T20:32:43.2721877Z     contiguous=False,
2025-05-07T20:32:43.2721953Z     compiled=False,
2025-05-07T20:32:43.2722021Z )
2025-05-07T20:32:43.2722233Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2722399Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:43.2722404Z 
2025-05-07T20:32:43.2722477Z     @given(
2025-05-07T20:32:43.2722589Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2722683Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2722791Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2722908Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2723011Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2723082Z     )
2025-05-07T20:32:43.2723320Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2723406Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2723479Z         self,
2025-05-07T20:32:43.2723547Z         T: int,
2025-05-07T20:32:43.2723624Z         D: int,
2025-05-07T20:32:43.2723715Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2723797Z         contiguous: bool,
2025-05-07T20:32:43.2723877Z         compiled: bool,
2025-05-07T20:32:43.2723949Z     ) -> None:
2025-05-07T20:32:43.2724033Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2724100Z     
2025-05-07T20:32:43.2724258Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2726020Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 30.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 5.24 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2726106Z 
2025-05-07T20:32:43.2726217Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.2726221Z 
2025-05-07T20:32:43.2726314Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2726531Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2726602Z     T=128,
2025-05-07T20:32:43.2726674Z     D=7168,
2025-05-07T20:32:43.2726750Z     scale_ub=1200.0,
2025-05-07T20:32:43.2726823Z     contiguous=True,
2025-05-07T20:32:43.2726901Z     compiled=True,
2025-05-07T20:32:43.2726973Z )
2025-05-07T20:32:43.2727180Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2727346Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:43.2727351Z 
2025-05-07T20:32:43.2727425Z     @given(
2025-05-07T20:32:43.2727534Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2727629Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2727734Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2727849Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2727954Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2728019Z     )
2025-05-07T20:32:43.2728258Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2728345Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2728412Z         self,
2025-05-07T20:32:43.2728483Z         T: int,
2025-05-07T20:32:43.2728560Z         D: int,
2025-05-07T20:32:43.2728650Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2728738Z         contiguous: bool,
2025-05-07T20:32:43.2728817Z         compiled: bool,
2025-05-07T20:32:43.2728988Z     ) -> None:
2025-05-07T20:32:43.2729078Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2729142Z     
2025-05-07T20:32:43.2729302Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2729368Z     
2025-05-07T20:32:43.2729450Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2729570Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2729651Z         x = x_sign * x_clamp
2025-05-07T20:32:43.2729722Z         x0 = x[:, :D]
2025-05-07T20:32:43.2729795Z         x1 = x[:, D:]
2025-05-07T20:32:43.2729859Z     
2025-05-07T20:32:43.2729933Z         if contiguous:
2025-05-07T20:32:43.2730019Z             x0 = x0.contiguous()
2025-05-07T20:32:43.2730099Z             x1 = x1.contiguous()
2025-05-07T20:32:43.2730164Z     
2025-05-07T20:32:43.2730253Z         if scale_ub is not None:
2025-05-07T20:32:43.2730352Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:43.2730491Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:43.2730560Z             )
2025-05-07T20:32:43.2730628Z         else:
2025-05-07T20:32:43.2730717Z             scale_ub_tensor = None
2025-05-07T20:32:43.2730785Z     
2025-05-07T20:32:43.2730911Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:43.2730995Z             op = silu_mul_quant
2025-05-07T20:32:43.2731072Z             if compiled:
2025-05-07T20:32:43.2731164Z                 op = torch.compile(op)
2025-05-07T20:32:43.2731263Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2731327Z     
2025-05-07T20:32:43.2731411Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:43.2731416Z 
2025-05-07T20:32:43.2731516Z moe/activation_test.py:117: 
2025-05-07T20:32:43.2731639Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2731888Z moe/activation_test.py:115: in fn
2025-05-07T20:32:43.2731978Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:43.2732346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:43.2732435Z     return fn(*args, **kwargs)
2025-05-07T20:32:43.2732917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:43.2733007Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:43.2733360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:43.2733573Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:43.2733905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:43.2733991Z     kernel = self.compile(
2025-05-07T20:32:43.2734388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:43.2734565Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:43.2734682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:43.2734687Z 
2025-05-07T20:32:43.2734883Z self = <triton.compiler.compiler.ASTSource object at 0x7f13c459ba50>
2025-05-07T20:32:43.2735640Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:43.2736126Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f1429f7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f13c4260900>}
2025-05-07T20:32:43.2736855Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:43.2737121Z context = <triton._C.libtriton.ir.context object at 0x7f13c42b68b0>
2025-05-07T20:32:43.2737126Z 
2025-05-07T20:32:43.2737286Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:43.2737538Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:43.2737639Z                            module_map=module_map)
2025-05-07T20:32:43.2737795Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:43.2737884Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:43.2737956Z E       ^
2025-05-07T20:32:43.2738300Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:43.2738305Z 
2025-05-07T20:32:43.2738713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:43.2738723Z 
2025-05-07T20:32:43.2738827Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2739039Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2739107Z     T=128,
2025-05-07T20:32:43.2739183Z     D=7168,
2025-05-07T20:32:43.2739259Z     scale_ub=1200.0,
2025-05-07T20:32:43.2739337Z     contiguous=True,
2025-05-07T20:32:43.2739413Z     compiled=False,
2025-05-07T20:32:43.2739480Z )
2025-05-07T20:32:43.2739691Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2739851Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:43.2739856Z 
2025-05-07T20:32:43.2739924Z     @given(
2025-05-07T20:32:43.2740042Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2740366Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2740614Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2740728Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2740840Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2740907Z     )
2025-05-07T20:32:43.2741143Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2741230Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2741299Z         self,
2025-05-07T20:32:43.2741368Z         T: int,
2025-05-07T20:32:43.2741437Z         D: int,
2025-05-07T20:32:43.2741531Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2741612Z         contiguous: bool,
2025-05-07T20:32:43.2741689Z         compiled: bool,
2025-05-07T20:32:43.2741764Z     ) -> None:
2025-05-07T20:32:43.2741853Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2741915Z     
2025-05-07T20:32:43.2742078Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2742142Z     
2025-05-07T20:32:43.2742236Z         x_sign = torch.sign(x)
2025-05-07T20:32:43.2742351Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:43.2744099Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2744108Z 
2025-05-07T20:32:43.2744218Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:32:43.2744222Z 
2025-05-07T20:32:43.2744316Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2744535Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2744609Z     T=128,
2025-05-07T20:32:43.2744677Z     D=5120,
2025-05-07T20:32:43.2744753Z     scale_ub=1200.0,
2025-05-07T20:32:43.2744950Z     contiguous=True,
2025-05-07T20:32:43.2745030Z     compiled=True,
2025-05-07T20:32:43.2745098Z )
2025-05-07T20:32:43.2745308Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2745473Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:43.2745477Z 
2025-05-07T20:32:43.2745547Z     @given(
2025-05-07T20:32:43.2745656Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2745750Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2745856Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2745964Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2746070Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2746138Z     )
2025-05-07T20:32:43.2746372Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2746466Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2746538Z         self,
2025-05-07T20:32:43.2746618Z         T: int,
2025-05-07T20:32:43.2746690Z         D: int,
2025-05-07T20:32:43.2746778Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2746868Z         contiguous: bool,
2025-05-07T20:32:43.2746948Z         compiled: bool,
2025-05-07T20:32:43.2747015Z     ) -> None:
2025-05-07T20:32:43.2747107Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2747174Z     
2025-05-07T20:32:43.2747356Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2747482Z     
2025-05-07T20:32:43.2747575Z >       x_sign = torch.sign(x)
2025-05-07T20:32:43.2749319Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2749409Z 
2025-05-07T20:32:43.2749519Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:32:43.2749524Z 
2025-05-07T20:32:43.2749624Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:43.2749838Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:43.2749910Z     T=128,
2025-05-07T20:32:43.2749981Z     D=7168,
2025-05-07T20:32:43.2750052Z     scale_ub=None,
2025-05-07T20:32:43.2750127Z     contiguous=True,
2025-05-07T20:32:43.2750208Z     compiled=True,
2025-05-07T20:32:43.2750272Z )
2025-05-07T20:32:43.2750479Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:43.2750646Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:43.2750651Z 
2025-05-07T20:32:43.2750719Z     @given(
2025-05-07T20:32:43.2750836Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:43.2750926Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:43.2751031Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:43.2751146Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:43.2751251Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:43.2751320Z     )
2025-05-07T20:32:43.2751557Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:43.2751644Z     def test_silu_mul_quant(
2025-05-07T20:32:43.2751713Z         self,
2025-05-07T20:32:43.2751788Z         T: int,
2025-05-07T20:32:43.2751856Z         D: int,
2025-05-07T20:32:43.2751944Z         scale_ub: Optional[float],
2025-05-07T20:32:43.2752026Z         contiguous: bool,
2025-05-07T20:32:43.2752106Z         compiled: bool,
2025-05-07T20:32:43.2752176Z     ) -> None:
2025-05-07T20:32:43.2752263Z         torch.manual_seed(2025)
2025-05-07T20:32:43.2752329Z     
2025-05-07T20:32:43.2752569Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:43.2754299Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 8.44 MiB is free. Including non-PyTorch memory, this process has 22.05 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 2.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:32:43.2754305Z 
2025-05-07T20:32:43.2754418Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:32:43.2754551Z =============================== warnings summary ===============================
2025-05-07T20:32:43.2754856Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:43.2755149Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:43.2755436Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:32:43.2756301Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:32:43.2756520Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:32:43.2756524Z 
2025-05-07T20:32:43.2756695Z experimental/gen_ai/test/moe/activation_test.py: 10 warnings
2025-05-07T20:32:43.2758014Z   /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py:72: FutureWarning: `torch.testing.assert_allclose()` is deprecated since 1.12 and will be removed in a future release. Please use `torch.testing.assert_close()` instead. You can find detailed upgrade instructions in https://github.com/pytorch/pytorch/issues/61844.
2025-05-07T20:32:43.2758192Z     torch.testing.assert_allclose(y, y_ref, rtol=1.6e-2, atol=1e-3)
2025-05-07T20:32:43.2758197Z 
2025-05-07T20:32:43.2758403Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:32:43.2758558Z ================== 1 failed, 1 passed, 13 warnings in 18.99s ===================
2025-05-07T20:32:45.1015068Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --cache-clear ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:32:45.1641137Z 
2025-05-07T20:32:45.1641868Z [TEST] Some tests FAILED.  Re-attempting only FAILED tests: ./moe/activation_test.py
2025-05-07T20:32:45.1642248Z 
2025-05-07T20:32:45.1642253Z 
2025-05-07T20:32:45.1662082Z [EXEC] [ATTEMPT 0/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:32:47.3190622Z ============================= test session starts ==============================
2025-05-07T20:32:47.3191258Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:32:47.3191791Z cachedir: .pytest_cache
2025-05-07T20:32:47.3192379Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:32:47.3193103Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:32:47.3193495Z plugins: hypothesis-6.131.14
2025-05-07T20:32:48.8795495Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:32:48.9758010Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:32:48.9758762Z run-last-failure: rerun previous 1 failure
2025-05-07T20:32:48.9758997Z 
2025-05-07T20:32:50.8377792Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:50.8379439Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last):
2025-05-07T20:32:50.8380802Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:50.8382268Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:50.8383272Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:50.8384554Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:50.8385912Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.8387194Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:50.8389082Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.8390122Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                        module_map=module_map)
2025-05-07T20:32:50.8391421Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:50.8392645Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     generator.visit(fn.parse())
2025-05-07T20:32:50.8393467Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:50.8394649Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:50.8395832Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ret = super().visit(node)
2025-05-07T20:32:50.8396840Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:50.8397837Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return visitor(node)
2025-05-07T20:32:50.8399024Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:50.8400427Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:50.8401309Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:50.8402371Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:50.8403394Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     self.visit(item)
2025-05-07T20:32:50.8404139Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:50.8405285Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:50.8406628Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:50.8407665Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.8408555Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.8409319Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^
2025-05-07T20:32:50.8410383Z W0507 20:32:50.835000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:50.8545628Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:50.8546693Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] Traceback (most recent call last):
2025-05-07T20:32:50.8548046Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:50.8549474Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:50.8550429Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:50.8551713Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:50.8553065Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:50.8554816Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:50.8556158Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:50.8557407Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]                        module_map=module_map)
2025-05-07T20:32:50.8558666Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:50.8559941Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     generator.visit(fn.parse())
2025-05-07T20:32:50.8560767Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:50.8561941Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:50.8563133Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ret = super().visit(node)
2025-05-07T20:32:50.8564147Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:50.8565174Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     return visitor(node)
2025-05-07T20:32:50.8566369Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:50.8567617Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:50.8568626Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:50.8569750Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:50.8570771Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     self.visit(item)
2025-05-07T20:32:50.8571523Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:50.8572673Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:50.8574009Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:50.8575054Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:50.8575951Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:50.8576674Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ^
2025-05-07T20:32:50.8577673Z W0507 20:32:50.853000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/0] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.2532121Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.2532965Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.2533415Z     T=1,
2025-05-07T20:32:51.2533604Z     D=5120,
2025-05-07T20:32:51.2533797Z     scale_ub=None,
2025-05-07T20:32:51.2534377Z     contiguous=True,
2025-05-07T20:32:51.2534611Z     compiled=True,
2025-05-07T20:32:51.2534815Z )
2025-05-07T20:32:51.2535143Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.2535641Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:51.2535901Z 
2025-05-07T20:32:51.2535981Z     @given(
2025-05-07T20:32:51.2536213Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.2536527Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.2536833Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.2537162Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.2537481Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.2537765Z     )
2025-05-07T20:32:51.2538135Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.2538583Z     def test_silu_mul_quant(
2025-05-07T20:32:51.2538831Z         self,
2025-05-07T20:32:51.2539037Z         T: int,
2025-05-07T20:32:51.2539251Z         D: int,
2025-05-07T20:32:51.2539487Z         scale_ub: Optional[float],
2025-05-07T20:32:51.2539757Z         contiguous: bool,
2025-05-07T20:32:51.2539987Z         compiled: bool,
2025-05-07T20:32:51.2540549Z     ) -> None:
2025-05-07T20:32:51.2540767Z         torch.manual_seed(2025)
2025-05-07T20:32:51.2541010Z     
2025-05-07T20:32:51.2541279Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.2541626Z     
2025-05-07T20:32:51.2541811Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.2542100Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.2542406Z         x = x_sign * x_clamp
2025-05-07T20:32:51.2542640Z         x0 = x[:, :D]
2025-05-07T20:32:51.2543043Z         x1 = x[:, D:]
2025-05-07T20:32:51.2543249Z     
2025-05-07T20:32:51.2543436Z         if contiguous:
2025-05-07T20:32:51.2543661Z             x0 = x0.contiguous()
2025-05-07T20:32:51.2543919Z             x1 = x1.contiguous()
2025-05-07T20:32:51.2544164Z     
2025-05-07T20:32:51.2544348Z         if scale_ub is not None:
2025-05-07T20:32:51.2544616Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.2544950Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.2545255Z             )
2025-05-07T20:32:51.2545447Z         else:
2025-05-07T20:32:51.2545655Z             scale_ub_tensor = None
2025-05-07T20:32:51.2545894Z     
2025-05-07T20:32:51.2546121Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.2546431Z             op = silu_mul_quant
2025-05-07T20:32:51.2546670Z             if compiled:
2025-05-07T20:32:51.2546915Z                 op = torch.compile(op)
2025-05-07T20:32:51.2547205Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.2547557Z     
2025-05-07T20:32:51.2547742Z         y_fp8, y_scale = fn()
2025-05-07T20:32:51.2548023Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:51.2548315Z     
2025-05-07T20:32:51.2548540Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.2548875Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:51.2549161Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:51.2549463Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:51.2549818Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:51.2550120Z     
2025-05-07T20:32:51.2550312Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:51.2550509Z 
2025-05-07T20:32:51.2550607Z moe/activation_test.py:126: 
2025-05-07T20:32:51.2550899Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.2551226Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:51.2551548Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:51.2552452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:51.2553209Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:51.2553741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.2554419Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.2555107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:51.2555826Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:51.2556555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:51.2557184Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:51.2557781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:51.2558295Z     fn()
2025-05-07T20:32:51.2558812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:51.2559380Z     self.fn.run(
2025-05-07T20:32:51.2559845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.2560355Z     kernel = self.compile(
2025-05-07T20:32:51.2560909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.2561575Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.2561967Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.2562188Z 
2025-05-07T20:32:51.2562394Z self = <triton.compiler.compiler.ASTSource object at 0x7f37b42af9d0>
2025-05-07T20:32:51.2563556Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.2564918Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37b445b6a0>}
2025-05-07T20:32:51.2566270Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.2567316Z context = <triton._C.libtriton.ir.context object at 0x7f37b4778e70>
2025-05-07T20:32:51.2567605Z 
2025-05-07T20:32:51.2567767Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.2568284Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.2568748Z                            module_map=module_map)
2025-05-07T20:32:51.2569104Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.2569459Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:51.2569714Z E       ^
2025-05-07T20:32:51.2570167Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.2570627Z 
2025-05-07T20:32:51.2571047Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.2571559Z 
2025-05-07T20:32:51.2571659Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:51.2572070Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:51.2572459Z     T=2048,
2025-05-07T20:32:51.2572649Z     D=5120,
2025-05-07T20:32:51.2572850Z     scale_ub=1200.0,
2025-05-07T20:32:51.2573064Z     contiguous=True,
2025-05-07T20:32:51.2573285Z     compiled=False,
2025-05-07T20:32:51.2573488Z )
2025-05-07T20:32:51.2573890Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:51.2574378Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:51.2574651Z 
2025-05-07T20:32:51.2574727Z     @given(
2025-05-07T20:32:51.2574953Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:51.2575254Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:51.2575554Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:51.2575878Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:51.2576192Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:51.2576472Z     )
2025-05-07T20:32:51.2576821Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:51.2577259Z     def test_silu_mul_quant(
2025-05-07T20:32:51.2577508Z         self,
2025-05-07T20:32:51.2577697Z         T: int,
2025-05-07T20:32:51.2577892Z         D: int,
2025-05-07T20:32:51.2578110Z         scale_ub: Optional[float],
2025-05-07T20:32:51.2578381Z         contiguous: bool,
2025-05-07T20:32:51.2578614Z         compiled: bool,
2025-05-07T20:32:51.2578826Z     ) -> None:
2025-05-07T20:32:51.2579041Z         torch.manual_seed(2025)
2025-05-07T20:32:51.2579283Z     
2025-05-07T20:32:51.2579542Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:51.2579878Z     
2025-05-07T20:32:51.2580065Z         x_sign = torch.sign(x)
2025-05-07T20:32:51.2580346Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:51.2580654Z         x = x_sign * x_clamp
2025-05-07T20:32:51.2580896Z         x0 = x[:, :D]
2025-05-07T20:32:51.2581102Z         x1 = x[:, D:]
2025-05-07T20:32:51.2581306Z     
2025-05-07T20:32:51.2581484Z         if contiguous:
2025-05-07T20:32:51.2581703Z             x0 = x0.contiguous()
2025-05-07T20:32:51.2582042Z             x1 = x1.contiguous()
2025-05-07T20:32:51.2582275Z     
2025-05-07T20:32:51.2582452Z         if scale_ub is not None:
2025-05-07T20:32:51.2582726Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:51.2583053Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:51.2583360Z             )
2025-05-07T20:32:51.2583548Z         else:
2025-05-07T20:32:51.2583757Z             scale_ub_tensor = None
2025-05-07T20:32:51.2584006Z     
2025-05-07T20:32:51.2584229Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:51.2584535Z             op = silu_mul_quant
2025-05-07T20:32:51.2584784Z             if compiled:
2025-05-07T20:32:51.2585022Z                 op = torch.compile(op)
2025-05-07T20:32:51.2585314Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.2585579Z     
2025-05-07T20:32:51.2585762Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:51.2585928Z 
2025-05-07T20:32:51.2586030Z moe/activation_test.py:117: 
2025-05-07T20:32:51.2586321Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.2586651Z moe/activation_test.py:115: in fn
2025-05-07T20:32:51.2586924Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:51.2587785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:51.2588732Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:51.2589278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:51.2589949Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:51.2590603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:51.2591123Z     kernel = self.compile(
2025-05-07T20:32:51.2591649Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:51.2592321Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.2592840Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:51.2593066Z 
2025-05-07T20:32:51.2593269Z self = <triton.compiler.compiler.ASTSource object at 0x7f37b4449a70>
2025-05-07T20:32:51.2594331Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:51.2595681Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37b40c1f80>}
2025-05-07T20:32:51.2597038Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:51.2598103Z context = <triton._C.libtriton.ir.context object at 0x7f37ae7fe670>
2025-05-07T20:32:51.2598383Z 
2025-05-07T20:32:51.2598544Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:51.2599058Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.2599530Z                            module_map=module_map)
2025-05-07T20:32:51.2599885Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.2600224Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.2600481Z E       ^
2025-05-07T20:32:51.2600941Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.2601386Z 
2025-05-07T20:32:51.2601803Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:51.6506766Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:51.6508039Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last):
2025-05-07T20:32:51.6509357Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:51.6510759Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:51.6511724Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:51.6513030Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:51.6514433Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.6515718Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:51.6517072Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.6518444Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                        module_map=module_map)
2025-05-07T20:32:51.6519807Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:51.6521033Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     generator.visit(fn.parse())
2025-05-07T20:32:51.6521863Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:51.6523035Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:51.6524229Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ret = super().visit(node)
2025-05-07T20:32:51.6525260Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:51.6526264Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return visitor(node)
2025-05-07T20:32:51.6527466Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:51.6528729Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:51.6529802Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:51.6530875Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:51.6531897Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     self.visit(item)
2025-05-07T20:32:51.6532653Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:51.6533807Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:51.6535136Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:51.6536189Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.6537090Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.6537809Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^
2025-05-07T20:32:51.6538811Z W0507 20:32:51.646000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:51.7276616Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:51.7277755Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] Traceback (most recent call last):
2025-05-07T20:32:51.7279307Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:51.7280783Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:51.7281733Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:51.7283076Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:51.7284440Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:51.7285718Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:51.7287061Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:51.7288081Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]                        module_map=module_map)
2025-05-07T20:32:51.7289460Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:51.7290682Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     generator.visit(fn.parse())
2025-05-07T20:32:51.7291509Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:51.7292686Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:51.7293861Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ret = super().visit(node)
2025-05-07T20:32:51.7294877Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:51.7295909Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     return visitor(node)
2025-05-07T20:32:51.7297101Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:51.7298352Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:51.7299240Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:51.7300306Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:51.7301412Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     self.visit(item)
2025-05-07T20:32:51.7302169Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:51.7303311Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:51.7304647Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:51.7305689Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:51.7306603Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:51.7307319Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ^
2025-05-07T20:32:51.7308400Z W0507 20:32:51.724000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/1] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3063535Z 
2025-05-07T20:32:52.3063870Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3064332Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3064737Z     T=2048,
2025-05-07T20:32:52.3064981Z     D=5120,
2025-05-07T20:32:52.3065246Z     scale_ub=1200.0,
2025-05-07T20:32:52.3065542Z     contiguous=True,
2025-05-07T20:32:52.3066071Z     compiled=True,
2025-05-07T20:32:52.3066272Z )
2025-05-07T20:32:52.3066592Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3067090Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:52.3067355Z 
2025-05-07T20:32:52.3067509Z     @given(
2025-05-07T20:32:52.3067735Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3068048Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3068342Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3068666Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3068985Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3069264Z     )
2025-05-07T20:32:52.3069624Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3070081Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3070321Z         self,
2025-05-07T20:32:52.3070502Z         T: int,
2025-05-07T20:32:52.3070703Z         D: int,
2025-05-07T20:32:52.3070918Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3071180Z         contiguous: bool,
2025-05-07T20:32:52.3071420Z         compiled: bool,
2025-05-07T20:32:52.3071643Z     ) -> None:
2025-05-07T20:32:52.3071847Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3072086Z     
2025-05-07T20:32:52.3072386Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3072736Z     
2025-05-07T20:32:52.3072931Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3073214Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3073526Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3073766Z         x0 = x[:, :D]
2025-05-07T20:32:52.3073974Z         x1 = x[:, D:]
2025-05-07T20:32:52.3074181Z     
2025-05-07T20:32:52.3074361Z         if contiguous:
2025-05-07T20:32:52.3074579Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3074832Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3075070Z     
2025-05-07T20:32:52.3075249Z         if scale_ub is not None:
2025-05-07T20:32:52.3075524Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3076012Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3076316Z             )
2025-05-07T20:32:52.3076501Z         else:
2025-05-07T20:32:52.3076708Z             scale_ub_tensor = None
2025-05-07T20:32:52.3076955Z     
2025-05-07T20:32:52.3077173Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3077489Z             op = silu_mul_quant
2025-05-07T20:32:52.3077734Z             if compiled:
2025-05-07T20:32:52.3077973Z                 op = torch.compile(op)
2025-05-07T20:32:52.3078263Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3078535Z     
2025-05-07T20:32:52.3078717Z         y_fp8, y_scale = fn()
2025-05-07T20:32:52.3078996Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:52.3079280Z     
2025-05-07T20:32:52.3079507Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3079837Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:52.3080121Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:52.3080436Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:52.3080783Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:52.3081085Z     
2025-05-07T20:32:52.3081280Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:52.3081469Z 
2025-05-07T20:32:52.3081565Z moe/activation_test.py:126: 
2025-05-07T20:32:52.3081857Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3082186Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:52.3082509Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:52.3083292Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:52.3084132Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:52.3084670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3085340Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3086026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:52.3086739Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:52.3087466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:52.3088087Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:52.3088695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:52.3089206Z     fn()
2025-05-07T20:32:52.3089761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:52.3090361Z     self.fn.run(
2025-05-07T20:32:52.3090841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3091360Z     kernel = self.compile(
2025-05-07T20:32:52.3091892Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3092561Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3092945Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3093165Z 
2025-05-07T20:32:52.3093374Z self = <triton.compiler.compiler.ASTSource object at 0x7f37ae642e00>
2025-05-07T20:32:52.3094500Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3096074Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37b41191c0>}
2025-05-07T20:32:52.3097398Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3098461Z context = <triton._C.libtriton.ir.context object at 0x7f37ae5b9f30>
2025-05-07T20:32:52.3098747Z 
2025-05-07T20:32:52.3098909Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3099431Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3099898Z                            module_map=module_map)
2025-05-07T20:32:52.3100268Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3100617Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:52.3100885Z E       ^
2025-05-07T20:32:52.3101348Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3101792Z 
2025-05-07T20:32:52.3102216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.3102723Z 
2025-05-07T20:32:52.3102824Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:52.3103231Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:52.3103632Z     T=16384,
2025-05-07T20:32:52.3103816Z     D=7168,
2025-05-07T20:32:52.3104010Z     scale_ub=1200.0,
2025-05-07T20:32:52.3104232Z     contiguous=False,
2025-05-07T20:32:52.3104452Z     compiled=False,
2025-05-07T20:32:52.3104654Z )
2025-05-07T20:32:52.3104965Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:52.3105532Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:52.3105823Z 
2025-05-07T20:32:52.3105898Z     @given(
2025-05-07T20:32:52.3106133Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:52.3106447Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:52.3106744Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:52.3107076Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:52.3107475Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:52.3107755Z     )
2025-05-07T20:32:52.3108103Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:52.3108550Z     def test_silu_mul_quant(
2025-05-07T20:32:52.3108785Z         self,
2025-05-07T20:32:52.3108994Z         T: int,
2025-05-07T20:32:52.3109486Z         D: int,
2025-05-07T20:32:52.3109784Z         scale_ub: Optional[float],
2025-05-07T20:32:52.3110085Z         contiguous: bool,
2025-05-07T20:32:52.3110551Z         compiled: bool,
2025-05-07T20:32:52.3117233Z     ) -> None:
2025-05-07T20:32:52.3117488Z         torch.manual_seed(2025)
2025-05-07T20:32:52.3117743Z     
2025-05-07T20:32:52.3118014Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:52.3118356Z     
2025-05-07T20:32:52.3118540Z         x_sign = torch.sign(x)
2025-05-07T20:32:52.3118827Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:52.3119140Z         x = x_sign * x_clamp
2025-05-07T20:32:52.3119381Z         x0 = x[:, :D]
2025-05-07T20:32:52.3119618Z         x1 = x[:, D:]
2025-05-07T20:32:52.3119847Z     
2025-05-07T20:32:52.3120032Z         if contiguous:
2025-05-07T20:32:52.3120258Z             x0 = x0.contiguous()
2025-05-07T20:32:52.3120519Z             x1 = x1.contiguous()
2025-05-07T20:32:52.3120763Z     
2025-05-07T20:32:52.3120960Z         if scale_ub is not None:
2025-05-07T20:32:52.3121228Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:52.3121567Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:52.3121882Z             )
2025-05-07T20:32:52.3122063Z         else:
2025-05-07T20:32:52.3122381Z             scale_ub_tensor = None
2025-05-07T20:32:52.3122633Z     
2025-05-07T20:32:52.3122860Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:52.3123166Z             op = silu_mul_quant
2025-05-07T20:32:52.3123412Z             if compiled:
2025-05-07T20:32:52.3123649Z                 op = torch.compile(op)
2025-05-07T20:32:52.3123940Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3124214Z     
2025-05-07T20:32:52.3124397Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:52.3124565Z 
2025-05-07T20:32:52.3124662Z moe/activation_test.py:117: 
2025-05-07T20:32:52.3124954Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3125277Z moe/activation_test.py:115: in fn
2025-05-07T20:32:52.3125550Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:52.3126232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:52.3126917Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:52.3127451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:52.3128114Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:52.3128770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:52.3129291Z     kernel = self.compile(
2025-05-07T20:32:52.3129840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:52.3130486Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.3130875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:52.3131183Z 
2025-05-07T20:32:52.3131393Z self = <triton.compiler.compiler.ASTSource object at 0x7f37ae643130>
2025-05-07T20:32:52.3132468Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:52.3133820Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37b42aa980>}
2025-05-07T20:32:52.3135145Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:52.3136206Z context = <triton._C.libtriton.ir.context object at 0x7f37ae5f18b0>
2025-05-07T20:32:52.3136486Z 
2025-05-07T20:32:52.3136654Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:52.3137178Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.3137644Z                            module_map=module_map)
2025-05-07T20:32:52.3138004Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.3138352Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.3138610Z E       ^
2025-05-07T20:32:52.3139070Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.3139515Z 
2025-05-07T20:32:52.3139938Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:52.5385628Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:52.5386692Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last):
2025-05-07T20:32:52.5388239Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:52.5389664Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:52.5390645Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:52.5391930Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:52.5393301Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.5394588Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:52.5395939Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.5396974Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                        module_map=module_map)
2025-05-07T20:32:52.5398219Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:52.5399589Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     generator.visit(fn.parse())
2025-05-07T20:32:52.5400407Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:52.5401618Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:52.5402798Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ret = super().visit(node)
2025-05-07T20:32:52.5403801Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:52.5404804Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return visitor(node)
2025-05-07T20:32:52.5405998Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:52.5407250Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:52.5408125Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:52.5409190Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:52.5410295Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     self.visit(item)
2025-05-07T20:32:52.5411048Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:52.5412198Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:52.5413535Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:52.5414584Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.5415487Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.5416205Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^
2025-05-07T20:32:52.5417196Z W0507 20:32:52.534000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:52.5927932Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:52.5928965Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] Traceback (most recent call last):
2025-05-07T20:32:52.5930277Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:52.5931833Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:52.5932782Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:52.5934058Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:52.5935445Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:52.5936741Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:52.5938084Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:52.5939100Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]                        module_map=module_map)
2025-05-07T20:32:52.5940537Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:52.5941758Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     generator.visit(fn.parse())
2025-05-07T20:32:52.5942694Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:52.5943867Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:52.5945067Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ret = super().visit(node)
2025-05-07T20:32:52.5946078Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:52.5947067Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     return visitor(node)
2025-05-07T20:32:52.5948317Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:52.5949556Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:52.5950429Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:52.5951484Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:52.5952509Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     self.visit(item)
2025-05-07T20:32:52.5953375Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:52.5954509Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:52.5955832Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:52.5956861Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:52.5957746Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:52.5958467Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ^
2025-05-07T20:32:52.5959468Z W0507 20:32:52.589000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/2] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.0441617Z 
2025-05-07T20:32:53.0441885Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.0442307Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.0442723Z     T=1,
2025-05-07T20:32:53.0442919Z     D=7168,
2025-05-07T20:32:53.0443115Z     scale_ub=None,
2025-05-07T20:32:53.0443328Z     contiguous=True,
2025-05-07T20:32:53.0443585Z     compiled=True,
2025-05-07T20:32:53.0443859Z )
2025-05-07T20:32:53.0444206Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.0444688Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:53.0444957Z 
2025-05-07T20:32:53.0445032Z     @given(
2025-05-07T20:32:53.0445254Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.0445769Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.0446077Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.0446418Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.0446733Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.0447019Z     )
2025-05-07T20:32:53.0447364Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.0447801Z     def test_silu_mul_quant(
2025-05-07T20:32:53.0448047Z         self,
2025-05-07T20:32:53.0448241Z         T: int,
2025-05-07T20:32:53.0448437Z         D: int,
2025-05-07T20:32:53.0448659Z         scale_ub: Optional[float],
2025-05-07T20:32:53.0448922Z         contiguous: bool,
2025-05-07T20:32:53.0449159Z         compiled: bool,
2025-05-07T20:32:53.0449376Z     ) -> None:
2025-05-07T20:32:53.0449605Z         torch.manual_seed(2025)
2025-05-07T20:32:53.0449842Z     
2025-05-07T20:32:53.0450110Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.0450466Z     
2025-05-07T20:32:53.0450653Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.0450939Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.0451254Z         x = x_sign * x_clamp
2025-05-07T20:32:53.0451488Z         x0 = x[:, :D]
2025-05-07T20:32:53.0451702Z         x1 = x[:, D:]
2025-05-07T20:32:53.0451911Z     
2025-05-07T20:32:53.0452091Z         if contiguous:
2025-05-07T20:32:53.0452311Z             x0 = x0.contiguous()
2025-05-07T20:32:53.0452568Z             x1 = x1.contiguous()
2025-05-07T20:32:53.0452804Z     
2025-05-07T20:32:53.0452987Z         if scale_ub is not None:
2025-05-07T20:32:53.0453256Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.0453589Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.0454021Z             )
2025-05-07T20:32:53.0454209Z         else:
2025-05-07T20:32:53.0454416Z             scale_ub_tensor = None
2025-05-07T20:32:53.0454667Z     
2025-05-07T20:32:53.0454894Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.0455197Z             op = silu_mul_quant
2025-05-07T20:32:53.0455446Z             if compiled:
2025-05-07T20:32:53.0455686Z                 op = torch.compile(op)
2025-05-07T20:32:53.0455984Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.0456262Z     
2025-05-07T20:32:53.0456455Z         y_fp8, y_scale = fn()
2025-05-07T20:32:53.0456736Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:53.0457023Z     
2025-05-07T20:32:53.0457249Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.0457573Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:53.0457858Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:53.0458164Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:53.0458516Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:53.0458823Z     
2025-05-07T20:32:53.0459031Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:53.0459228Z 
2025-05-07T20:32:53.0459327Z moe/activation_test.py:126: 
2025-05-07T20:32:53.0459616Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.0459946Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:53.0460266Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:53.0461184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:53.0462107Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:53.0462764Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.0463570Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.0464358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:53.0465288Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:53.0466151Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:53.0466947Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:53.0467714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:53.0468331Z     fn()
2025-05-07T20:32:53.0468996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:53.0469654Z     self.fn.run(
2025-05-07T20:32:53.0470181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.0470870Z     kernel = self.compile(
2025-05-07T20:32:53.0471512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.0472223Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.0472788Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.0473074Z 
2025-05-07T20:32:53.0473309Z self = <triton.compiler.compiler.ASTSource object at 0x7f37aebbbc50>
2025-05-07T20:32:53.0474542Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.0476137Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37aec76520>}
2025-05-07T20:32:53.0477607Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.0478710Z context = <triton._C.libtriton.ir.context object at 0x7f37ae2be070>
2025-05-07T20:32:53.0479120Z 
2025-05-07T20:32:53.0479313Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.0479913Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.0480443Z                            module_map=module_map)
2025-05-07T20:32:53.0480953Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.0481411Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:53.0481709Z E       ^
2025-05-07T20:32:53.0482310Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.0482885Z 
2025-05-07T20:32:53.0483362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.0483902Z 
2025-05-07T20:32:53.0484101Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:53.0484575Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:53.0485081Z     T=4096,
2025-05-07T20:32:53.0485393Z     D=5120,
2025-05-07T20:32:53.0485685Z     scale_ub=None,
2025-05-07T20:32:53.0485975Z     contiguous=False,
2025-05-07T20:32:53.0486324Z     compiled=False,
2025-05-07T20:32:53.0486646Z )
2025-05-07T20:32:53.0487019Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:53.0487630Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:53.0487943Z 
2025-05-07T20:32:53.0488108Z     @given(
2025-05-07T20:32:53.0488432Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:53.0488876Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:53.0489302Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:53.0489796Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:53.0490216Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:53.0490624Z     )
2025-05-07T20:32:53.0491053Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:53.0491576Z     def test_silu_mul_quant(
2025-05-07T20:32:53.0491935Z         self,
2025-05-07T20:32:53.0492216Z         T: int,
2025-05-07T20:32:53.0492537Z         D: int,
2025-05-07T20:32:53.0492893Z         scale_ub: Optional[float],
2025-05-07T20:32:53.0493248Z         contiguous: bool,
2025-05-07T20:32:53.0493644Z         compiled: bool,
2025-05-07T20:32:53.0493922Z     ) -> None:
2025-05-07T20:32:53.0494220Z         torch.manual_seed(2025)
2025-05-07T20:32:53.0494612Z     
2025-05-07T20:32:53.0494935Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:53.0495386Z     
2025-05-07T20:32:53.0495711Z         x_sign = torch.sign(x)
2025-05-07T20:32:53.0496056Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:53.0496469Z         x = x_sign * x_clamp
2025-05-07T20:32:53.0496844Z         x0 = x[:, :D]
2025-05-07T20:32:53.0497191Z         x1 = x[:, D:]
2025-05-07T20:32:53.0497471Z     
2025-05-07T20:32:53.0497788Z         if contiguous:
2025-05-07T20:32:53.0498103Z             x0 = x0.contiguous()
2025-05-07T20:32:53.0498435Z             x1 = x1.contiguous()
2025-05-07T20:32:53.0498807Z     
2025-05-07T20:32:53.0499105Z         if scale_ub is not None:
2025-05-07T20:32:53.0499429Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:53.0499894Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:53.0500313Z             )
2025-05-07T20:32:53.0500558Z         else:
2025-05-07T20:32:53.0500943Z             scale_ub_tensor = None
2025-05-07T20:32:53.0501391Z     
2025-05-07T20:32:53.0501671Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:53.0502121Z             op = silu_mul_quant
2025-05-07T20:32:53.0502482Z             if compiled:
2025-05-07T20:32:53.0502802Z                 op = torch.compile(op)
2025-05-07T20:32:53.0503235Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.0503604Z     
2025-05-07T20:32:53.0503867Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:53.0504111Z 
2025-05-07T20:32:53.0504272Z moe/activation_test.py:117: 
2025-05-07T20:32:53.0504650Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.0505046Z moe/activation_test.py:115: in fn
2025-05-07T20:32:53.0505545Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:53.0506299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:53.0507063Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:53.0507830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:53.0508561Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:53.0509290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:53.0509990Z     kernel = self.compile(
2025-05-07T20:32:53.0510631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:53.0511357Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.0511902Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:53.0512186Z 
2025-05-07T20:32:53.0512418Z self = <triton.compiler.compiler.ASTSource object at 0x7f37aff2a4e0>
2025-05-07T20:32:53.0513722Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:53.0515269Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37aec77f60>}
2025-05-07T20:32:53.0516666Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:53.0517797Z context = <triton._C.libtriton.ir.context object at 0x7f37897430f0>
2025-05-07T20:32:53.0518153Z 
2025-05-07T20:32:53.0518347Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:53.0518966Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.0519522Z                            module_map=module_map)
2025-05-07T20:32:53.0520043Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.0520501Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.0520850Z E       ^
2025-05-07T20:32:53.0521426Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.0521933Z 
2025-05-07T20:32:53.0522427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:53.3339676Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:53.3341038Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last):
2025-05-07T20:32:53.3342522Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:53.3344186Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:53.3345261Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:53.3346669Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:53.3348207Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.3349594Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:53.3351160Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.3352287Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                        module_map=module_map)
2025-05-07T20:32:53.3353671Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:53.3355112Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     generator.visit(fn.parse())
2025-05-07T20:32:53.3356159Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:53.3357390Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:53.3358730Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ret = super().visit(node)
2025-05-07T20:32:53.3359836Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:53.3360908Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return visitor(node)
2025-05-07T20:32:53.3362298Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:53.3363645Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:53.3364588Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:53.3365804Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:53.3366967Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     self.visit(item)
2025-05-07T20:32:53.3367983Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:53.3369230Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:53.3370675Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:53.3371844Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.3372839Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.3373673Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^
2025-05-07T20:32:53.3374763Z W0507 20:32:53.330000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:53.5144731Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:53.5145999Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] Traceback (most recent call last):
2025-05-07T20:32:53.5147386Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:53.5148968Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:53.5150352Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:53.5151751Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:53.5153235Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:53.5154588Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:53.5156129Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:53.5157307Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]                        module_map=module_map)
2025-05-07T20:32:53.5158665Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:53.5160045Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     generator.visit(fn.parse())
2025-05-07T20:32:53.5160968Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:53.5162393Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:53.5163654Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ret = super().visit(node)
2025-05-07T20:32:53.5164832Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:53.5165911Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     return visitor(node)
2025-05-07T20:32:53.5167150Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:53.5168564Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:53.5169590Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:53.5170770Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:53.5171954Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     self.visit(item)
2025-05-07T20:32:53.5172758Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:53.5174072Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:53.5175549Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:53.5176687Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:53.5177691Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:53.5178484Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ^
2025-05-07T20:32:53.5179654Z W0507 20:32:53.510000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/3] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.0456098Z 
2025-05-07T20:32:54.0456721Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.0457345Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.0465011Z     T=4096,
2025-05-07T20:32:54.0465244Z     D=7168,
2025-05-07T20:32:54.0465435Z     scale_ub=None,
2025-05-07T20:32:54.0465643Z     contiguous=False,
2025-05-07T20:32:54.0465877Z     compiled=False,
2025-05-07T20:32:54.0466090Z )
2025-05-07T20:32:54.0466410Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.0467000Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:54.0467287Z 
2025-05-07T20:32:54.0467375Z     @given(
2025-05-07T20:32:54.0467657Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.0467980Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.0468282Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.0468809Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.0469130Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.0469407Z     )
2025-05-07T20:32:54.0469749Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.0470196Z     def test_silu_mul_quant(
2025-05-07T20:32:54.0470446Z         self,
2025-05-07T20:32:54.0470629Z         T: int,
2025-05-07T20:32:54.0470820Z         D: int,
2025-05-07T20:32:54.0471034Z         scale_ub: Optional[float],
2025-05-07T20:32:54.0471303Z         contiguous: bool,
2025-05-07T20:32:54.0471537Z         compiled: bool,
2025-05-07T20:32:54.0471790Z     ) -> None:
2025-05-07T20:32:54.0471999Z         torch.manual_seed(2025)
2025-05-07T20:32:54.0472237Z     
2025-05-07T20:32:54.0472501Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.0472833Z     
2025-05-07T20:32:54.0473017Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.0473298Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.0473594Z         x = x_sign * x_clamp
2025-05-07T20:32:54.0473834Z         x0 = x[:, :D]
2025-05-07T20:32:54.0474043Z         x1 = x[:, D:]
2025-05-07T20:32:54.0474238Z     
2025-05-07T20:32:54.0474420Z         if contiguous:
2025-05-07T20:32:54.0474643Z             x0 = x0.contiguous()
2025-05-07T20:32:54.0474888Z             x1 = x1.contiguous()
2025-05-07T20:32:54.0475117Z     
2025-05-07T20:32:54.0475298Z         if scale_ub is not None:
2025-05-07T20:32:54.0475559Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.0475889Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.0476182Z             )
2025-05-07T20:32:54.0476360Z         else:
2025-05-07T20:32:54.0476561Z             scale_ub_tensor = None
2025-05-07T20:32:54.0476806Z     
2025-05-07T20:32:54.0477023Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.0477330Z             op = silu_mul_quant
2025-05-07T20:32:54.0477577Z             if compiled:
2025-05-07T20:32:54.0477812Z                 op = torch.compile(op)
2025-05-07T20:32:54.0478226Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.0478495Z     
2025-05-07T20:32:54.0478673Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.0478836Z 
2025-05-07T20:32:54.0478932Z moe/activation_test.py:117: 
2025-05-07T20:32:54.0479219Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.0479542Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.0479808Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.0480501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.0481176Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.0481719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.0482582Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.0483248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.0483770Z     kernel = self.compile(
2025-05-07T20:32:54.0484319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.0484961Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.0485358Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.0485579Z 
2025-05-07T20:32:54.0485787Z self = <triton.compiler.compiler.ASTSource object at 0x7f37ae908f30>
2025-05-07T20:32:54.0486857Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.0488374Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37aec76ca0>}
2025-05-07T20:32:54.0489695Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.0490749Z context = <triton._C.libtriton.ir.context object at 0x7f37898d32f0>
2025-05-07T20:32:54.0491028Z 
2025-05-07T20:32:54.0491196Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.0491711Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.0492169Z                            module_map=module_map)
2025-05-07T20:32:54.0492534Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.0492879Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.0493133Z E       ^
2025-05-07T20:32:54.0493596Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.0494040Z 
2025-05-07T20:32:54.0494470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.0494984Z 
2025-05-07T20:32:54.0495081Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.0495481Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.0495872Z     T=128,
2025-05-07T20:32:54.0496045Z     D=7168,
2025-05-07T20:32:54.0496235Z     scale_ub=None,
2025-05-07T20:32:54.0496443Z     contiguous=False,
2025-05-07T20:32:54.0496657Z     compiled=True,
2025-05-07T20:32:54.0496850Z )
2025-05-07T20:32:54.0497161Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.0497650Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:54.0497922Z 
2025-05-07T20:32:54.0497997Z     @given(
2025-05-07T20:32:54.0498303Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.0498606Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.0498894Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.0499215Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.0499530Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.0499798Z     )
2025-05-07T20:32:54.0500137Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.0500572Z     def test_silu_mul_quant(
2025-05-07T20:32:54.0500801Z         self,
2025-05-07T20:32:54.0500986Z         T: int,
2025-05-07T20:32:54.0501174Z         D: int,
2025-05-07T20:32:54.0501376Z         scale_ub: Optional[float],
2025-05-07T20:32:54.0501638Z         contiguous: bool,
2025-05-07T20:32:54.0501873Z         compiled: bool,
2025-05-07T20:32:54.0502093Z     ) -> None:
2025-05-07T20:32:54.0502295Z         torch.manual_seed(2025)
2025-05-07T20:32:54.0502534Z     
2025-05-07T20:32:54.0502808Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.0503138Z     
2025-05-07T20:32:54.0503328Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.0503614Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.0503913Z         x = x_sign * x_clamp
2025-05-07T20:32:54.0504149Z         x0 = x[:, :D]
2025-05-07T20:32:54.0504364Z         x1 = x[:, D:]
2025-05-07T20:32:54.0504558Z     
2025-05-07T20:32:54.0504739Z         if contiguous:
2025-05-07T20:32:54.0504969Z             x0 = x0.contiguous()
2025-05-07T20:32:54.0505210Z             x1 = x1.contiguous()
2025-05-07T20:32:54.0505443Z     
2025-05-07T20:32:54.0505628Z         if scale_ub is not None:
2025-05-07T20:32:54.0505888Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.0506219Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.0506619Z             )
2025-05-07T20:32:54.0506806Z         else:
2025-05-07T20:32:54.0507006Z             scale_ub_tensor = None
2025-05-07T20:32:54.0507246Z     
2025-05-07T20:32:54.0507510Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.0507816Z             op = silu_mul_quant
2025-05-07T20:32:54.0508056Z             if compiled:
2025-05-07T20:32:54.0508296Z                 op = torch.compile(op)
2025-05-07T20:32:54.0508584Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.0508844Z     
2025-05-07T20:32:54.0509024Z         y_fp8, y_scale = fn()
2025-05-07T20:32:54.0509295Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:54.0509573Z     
2025-05-07T20:32:54.0509812Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.0510129Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:54.0510412Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:54.0510723Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:54.0511074Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.0511376Z     
2025-05-07T20:32:54.0511571Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.0511758Z 
2025-05-07T20:32:54.0511856Z moe/activation_test.py:126: 
2025-05-07T20:32:54.0512139Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.0512468Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:54.0512785Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.0513579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:54.0514315Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.0514851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.0515527Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.0516294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:54.0517003Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.0517735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:54.0518361Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.0518964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:54.0519469Z     fn()
2025-05-07T20:32:54.0519985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:54.0520571Z     self.fn.run(
2025-05-07T20:32:54.0521038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.0521582Z     kernel = self.compile(
2025-05-07T20:32:54.0522142Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.0522805Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.0523200Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.0523428Z 
2025-05-07T20:32:54.0523634Z self = <triton.compiler.compiler.ASTSource object at 0x7f3789a647d0>
2025-05-07T20:32:54.0524702Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.0526052Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37ae368180>}
2025-05-07T20:32:54.0527482Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.0528491Z context = <triton._C.libtriton.ir.context object at 0x7f3789b1ce30>
2025-05-07T20:32:54.0528773Z 
2025-05-07T20:32:54.0528946Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.0529455Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.0529925Z                            module_map=module_map)
2025-05-07T20:32:54.0530287Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.0530640Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.0530894Z E       ^
2025-05-07T20:32:54.0531345Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.0531795Z 
2025-05-07T20:32:54.0532233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2908150Z 
2025-05-07T20:32:54.2908426Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2908850Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2909477Z     T=128,
2025-05-07T20:32:54.2909730Z     D=7168,
2025-05-07T20:32:54.2910000Z     scale_ub=None,
2025-05-07T20:32:54.2910303Z     contiguous=False,
2025-05-07T20:32:54.2910559Z     compiled=False,
2025-05-07T20:32:54.2910760Z )
2025-05-07T20:32:54.2911075Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2911550Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:54.2911825Z 
2025-05-07T20:32:54.2911902Z     @given(
2025-05-07T20:32:54.2912140Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2912440Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2912943Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2913275Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2913594Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2913876Z     )
2025-05-07T20:32:54.2914222Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2914653Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2914886Z         self,
2025-05-07T20:32:54.2915074Z         T: int,
2025-05-07T20:32:54.2915272Z         D: int,
2025-05-07T20:32:54.2915477Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2915746Z         contiguous: bool,
2025-05-07T20:32:54.2915979Z         compiled: bool,
2025-05-07T20:32:54.2916190Z     ) -> None:
2025-05-07T20:32:54.2916397Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2916637Z     
2025-05-07T20:32:54.2916896Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2917231Z     
2025-05-07T20:32:54.2917429Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2917708Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2918018Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2918258Z         x0 = x[:, :D]
2025-05-07T20:32:54.2918461Z         x1 = x[:, D:]
2025-05-07T20:32:54.2918662Z     
2025-05-07T20:32:54.2918838Z         if contiguous:
2025-05-07T20:32:54.2919056Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2919304Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2919540Z     
2025-05-07T20:32:54.2919731Z         if scale_ub is not None:
2025-05-07T20:32:54.2920019Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2920374Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2920673Z             )
2025-05-07T20:32:54.2920855Z         else:
2025-05-07T20:32:54.2921184Z             scale_ub_tensor = None
2025-05-07T20:32:54.2921431Z     
2025-05-07T20:32:54.2921653Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2921968Z             op = silu_mul_quant
2025-05-07T20:32:54.2922213Z             if compiled:
2025-05-07T20:32:54.2922449Z                 op = torch.compile(op)
2025-05-07T20:32:54.2922742Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2923010Z     
2025-05-07T20:32:54.2923190Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2923353Z 
2025-05-07T20:32:54.2923448Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2923733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2924057Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2924327Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2925008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2925693Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2926223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2926894Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2927568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2928096Z     kernel = self.compile(
2025-05-07T20:32:54.2928645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2929291Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2929682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2929904Z 
2025-05-07T20:32:54.2930118Z self = <triton.compiler.compiler.ASTSource object at 0x7f37aebc2c90>
2025-05-07T20:32:54.2931259Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2932618Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37ae36b100>}
2025-05-07T20:32:54.2933933Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2934986Z context = <triton._C.libtriton.ir.context object at 0x7f3789b507f0>
2025-05-07T20:32:54.2935263Z 
2025-05-07T20:32:54.2935422Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2935940Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2936399Z                            module_map=module_map)
2025-05-07T20:32:54.2936752Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2937098Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2937350Z E       ^
2025-05-07T20:32:54.2937808Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2938270Z 
2025-05-07T20:32:54.2938678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2939187Z 
2025-05-07T20:32:54.2939290Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2939696Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2940351Z     T=4096,
2025-05-07T20:32:54.2940530Z     D=5120,
2025-05-07T20:32:54.2940715Z     scale_ub=1200.0,
2025-05-07T20:32:54.2940935Z     contiguous=True,
2025-05-07T20:32:54.2941271Z     compiled=False,
2025-05-07T20:32:54.2941472Z )
2025-05-07T20:32:54.2941788Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2942269Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:54.2942536Z 
2025-05-07T20:32:54.2942611Z     @given(
2025-05-07T20:32:54.2942827Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2943131Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2943421Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2943747Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2944063Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2944332Z     )
2025-05-07T20:32:54.2944676Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2945110Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2945342Z         self,
2025-05-07T20:32:54.2945534Z         T: int,
2025-05-07T20:32:54.2945737Z         D: int,
2025-05-07T20:32:54.2945942Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2946212Z         contiguous: bool,
2025-05-07T20:32:54.2946451Z         compiled: bool,
2025-05-07T20:32:54.2946665Z     ) -> None:
2025-05-07T20:32:54.2946878Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2947115Z     
2025-05-07T20:32:54.2947380Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2947766Z     
2025-05-07T20:32:54.2947956Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2948243Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2948544Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2948780Z         x0 = x[:, :D]
2025-05-07T20:32:54.2948999Z         x1 = x[:, D:]
2025-05-07T20:32:54.2949198Z     
2025-05-07T20:32:54.2949377Z         if contiguous:
2025-05-07T20:32:54.2949601Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2949849Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2950087Z     
2025-05-07T20:32:54.2950268Z         if scale_ub is not None:
2025-05-07T20:32:54.2950532Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2951005Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2951314Z             )
2025-05-07T20:32:54.2951493Z         else:
2025-05-07T20:32:54.2951701Z             scale_ub_tensor = None
2025-05-07T20:32:54.2951948Z     
2025-05-07T20:32:54.2952166Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2952471Z             op = silu_mul_quant
2025-05-07T20:32:54.2952717Z             if compiled:
2025-05-07T20:32:54.2952966Z                 op = torch.compile(op)
2025-05-07T20:32:54.2953254Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2953523Z     
2025-05-07T20:32:54.2953713Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:54.2953872Z 
2025-05-07T20:32:54.2953967Z moe/activation_test.py:117: 
2025-05-07T20:32:54.2954262Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2954592Z moe/activation_test.py:115: in fn
2025-05-07T20:32:54.2954860Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2955570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:54.2956244Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:54.2956774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2957434Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2958103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.2958625Z     kernel = self.compile(
2025-05-07T20:32:54.2959161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.2959888Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.2960278Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2960507Z 
2025-05-07T20:32:54.2960715Z self = <triton.compiler.compiler.ASTSource object at 0x7f37ae8f1390>
2025-05-07T20:32:54.2961819Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.2963174Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37ae1b1f80>}
2025-05-07T20:32:54.2964526Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.2965589Z context = <triton._C.libtriton.ir.context object at 0x7f3789be1cb0>
2025-05-07T20:32:54.2965868Z 
2025-05-07T20:32:54.2966045Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.2966555Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.2967071Z                            module_map=module_map)
2025-05-07T20:32:54.2967427Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.2967768Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.2968021Z E       ^
2025-05-07T20:32:54.2968473Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.2968930Z 
2025-05-07T20:32:54.2969353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.2969854Z 
2025-05-07T20:32:54.2969961Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:54.2970364Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:54.2970769Z     T=1,
2025-05-07T20:32:54.2971019Z     D=5120,
2025-05-07T20:32:54.2971208Z     scale_ub=None,
2025-05-07T20:32:54.2971415Z     contiguous=True,
2025-05-07T20:32:54.2971633Z     compiled=True,
2025-05-07T20:32:54.2971826Z )
2025-05-07T20:32:54.2972140Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:54.2972617Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:54.2972872Z 
2025-05-07T20:32:54.2972951Z     @given(
2025-05-07T20:32:54.2973176Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:54.2973482Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:54.2973780Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:54.2974123Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:54.2974448Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:54.2974735Z     )
2025-05-07T20:32:54.2975070Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:54.2975523Z     def test_silu_mul_quant(
2025-05-07T20:32:54.2975759Z         self,
2025-05-07T20:32:54.2975944Z         T: int,
2025-05-07T20:32:54.2976127Z         D: int,
2025-05-07T20:32:54.2976344Z         scale_ub: Optional[float],
2025-05-07T20:32:54.2976609Z         contiguous: bool,
2025-05-07T20:32:54.2976835Z         compiled: bool,
2025-05-07T20:32:54.2977047Z     ) -> None:
2025-05-07T20:32:54.2977254Z         torch.manual_seed(2025)
2025-05-07T20:32:54.2977480Z     
2025-05-07T20:32:54.2977740Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:54.2978077Z     
2025-05-07T20:32:54.2978258Z         x_sign = torch.sign(x)
2025-05-07T20:32:54.2978545Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:54.2978851Z         x = x_sign * x_clamp
2025-05-07T20:32:54.2979192Z         x0 = x[:, :D]
2025-05-07T20:32:54.2979409Z         x1 = x[:, D:]
2025-05-07T20:32:54.2979614Z     
2025-05-07T20:32:54.2979795Z         if contiguous:
2025-05-07T20:32:54.2980042Z             x0 = x0.contiguous()
2025-05-07T20:32:54.2980325Z             x1 = x1.contiguous()
2025-05-07T20:32:54.2986667Z     
2025-05-07T20:32:54.2986929Z         if scale_ub is not None:
2025-05-07T20:32:54.2987202Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:54.2987589Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:54.2987882Z             )
2025-05-07T20:32:54.2988065Z         else:
2025-05-07T20:32:54.2988279Z             scale_ub_tensor = None
2025-05-07T20:32:54.2988527Z     
2025-05-07T20:32:54.2988754Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2989069Z             op = silu_mul_quant
2025-05-07T20:32:54.2989319Z             if compiled:
2025-05-07T20:32:54.2989564Z                 op = torch.compile(op)
2025-05-07T20:32:54.2989876Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:54.2990185Z     
2025-05-07T20:32:54.2990375Z         y_fp8, y_scale = fn()
2025-05-07T20:32:54.2990667Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:54.2990964Z     
2025-05-07T20:32:54.2991200Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:54.2991528Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:54.2991828Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:54.2992136Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:54.2992481Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.2992784Z     
2025-05-07T20:32:54.2992987Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:54.2993178Z 
2025-05-07T20:32:54.2993274Z moe/activation_test.py:126: 
2025-05-07T20:32:54.2993571Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.2993906Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:54.2994230Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:54.2995121Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:54.2995892Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:54.2996448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:54.2997139Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:54.2997835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:54.2998544Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:54.2999276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:54.2999903Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:54.3000504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:54.3001009Z     fn()
2025-05-07T20:32:54.3001531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:54.3002113Z     self.fn.run(
2025-05-07T20:32:54.3002572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:54.3003086Z     kernel = self.compile(
2025-05-07T20:32:54.3003615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:54.3004278Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.3004665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:54.3004973Z 
2025-05-07T20:32:54.3005186Z self = <triton.compiler.compiler.ASTSource object at 0x7f37ae8f1910>
2025-05-07T20:32:54.3006253Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:54.3007602Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37ae35e520>}
2025-05-07T20:32:54.3008914Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:54.3009993Z context = <triton._C.libtriton.ir.context object at 0x7f3789bb41b0>
2025-05-07T20:32:54.3010297Z 
2025-05-07T20:32:54.3010474Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:54.3010985Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.3011454Z                            module_map=module_map)
2025-05-07T20:32:54.3011822Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.3012170Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:54.3012430Z E       ^
2025-05-07T20:32:54.3012891Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.3013353Z 
2025-05-07T20:32:54.3013767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:54.5237690Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:54.5238750Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last):
2025-05-07T20:32:54.5240518Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:54.5241952Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:54.5242917Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:54.5244193Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:54.5245590Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5246881Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:54.5248241Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5249278Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                        module_map=module_map)
2025-05-07T20:32:54.5250525Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:54.5251924Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     generator.visit(fn.parse())
2025-05-07T20:32:54.5252741Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:54.5253934Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:54.5255129Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ret = super().visit(node)
2025-05-07T20:32:54.5256145Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:54.5257150Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return visitor(node)
2025-05-07T20:32:54.5258351Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:54.5259625Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:54.5260572Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:54.5261647Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:54.5262747Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     self.visit(item)
2025-05-07T20:32:54.5263513Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:54.5264660Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:54.5265998Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:54.5267031Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5267997Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.5268725Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^
2025-05-07T20:32:54.5269724Z W0507 20:32:54.520000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:54.5859345Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:54.5860438Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] Traceback (most recent call last):
2025-05-07T20:32:54.5861788Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:54.5863357Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:54.5864309Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:54.5865580Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:54.5866935Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:54.5868285Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:54.5869630Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:54.5870650Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]                        module_map=module_map)
2025-05-07T20:32:54.5871936Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:54.5873156Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     generator.visit(fn.parse())
2025-05-07T20:32:54.5874117Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:54.5875304Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:54.5876482Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ret = super().visit(node)
2025-05-07T20:32:54.5877489Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:54.5878484Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     return visitor(node)
2025-05-07T20:32:54.5879679Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:54.5880935Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:54.5881813Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:54.5882873Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:54.5883880Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     self.visit(item)
2025-05-07T20:32:54.5884720Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:54.5885862Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:54.5887187Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:54.5888221Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:54.5889105Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:54.5889835Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ^
2025-05-07T20:32:54.5890830Z W0507 20:32:54.580000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/4] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.0785342Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:55.0786404Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last):
2025-05-07T20:32:55.0787808Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:55.0789229Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:55.0790370Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:55.0791653Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:55.0793031Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.0794331Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:55.0795742Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.0796777Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                        module_map=module_map)
2025-05-07T20:32:55.0798035Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:55.0799313Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     generator.visit(fn.parse())
2025-05-07T20:32:55.0800143Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:55.0801457Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:55.0802642Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ret = super().visit(node)
2025-05-07T20:32:55.0803655Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:55.0804654Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return visitor(node)
2025-05-07T20:32:55.0805990Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:55.0807270Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:55.0808169Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:55.0809227Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:55.0810307Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     self.visit(item)
2025-05-07T20:32:55.0811065Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:55.0812319Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:55.0813656Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:55.0814706Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.0815615Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.0816362Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^
2025-05-07T20:32:55.0817364Z W0507 20:32:55.075000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.1401222Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:55.1402314Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] Traceback (most recent call last):
2025-05-07T20:32:55.1403631Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:55.1405069Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:55.1406084Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:55.1407574Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:55.1408988Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.1410365Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:55.1411766Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.1412812Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]                        module_map=module_map)
2025-05-07T20:32:55.1414057Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:55.1415286Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     generator.visit(fn.parse())
2025-05-07T20:32:55.1416123Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:55.1417316Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:55.1418660Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ret = super().visit(node)
2025-05-07T20:32:55.1419678Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:55.1420679Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     return visitor(node)
2025-05-07T20:32:55.1421880Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:55.1423136Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:55.1424037Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:55.1425096Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:55.1426127Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     self.visit(item)
2025-05-07T20:32:55.1426911Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:55.1428129Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:55.1429546Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:55.1430584Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.1431472Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.1432199Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ^
2025-05-07T20:32:55.1433183Z W0507 20:32:55.136000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/5] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4119066Z 
2025-05-07T20:32:55.4119428Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.4120037Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4120649Z     T=2048,
2025-05-07T20:32:55.4120932Z     D=5120,
2025-05-07T20:32:55.4121198Z     scale_ub=None,
2025-05-07T20:32:55.4121478Z     contiguous=True,
2025-05-07T20:32:55.4121782Z     compiled=True,
2025-05-07T20:32:55.4122063Z )
2025-05-07T20:32:55.4122422Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.4122931Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:55.4123204Z 
2025-05-07T20:32:55.4123299Z     @given(
2025-05-07T20:32:55.4123529Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4123847Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.4124152Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.4124489Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.4124809Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.4125121Z     )
2025-05-07T20:32:55.4125478Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.4126122Z     def test_silu_mul_quant(
2025-05-07T20:32:55.4126374Z         self,
2025-05-07T20:32:55.4126588Z         T: int,
2025-05-07T20:32:55.4126782Z         D: int,
2025-05-07T20:32:55.4127006Z         scale_ub: Optional[float],
2025-05-07T20:32:55.4127295Z         contiguous: bool,
2025-05-07T20:32:55.4127528Z         compiled: bool,
2025-05-07T20:32:55.4127763Z     ) -> None:
2025-05-07T20:32:55.4127996Z         torch.manual_seed(2025)
2025-05-07T20:32:55.4128238Z     
2025-05-07T20:32:55.4128527Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4128883Z     
2025-05-07T20:32:55.4129076Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.4129372Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.4129695Z         x = x_sign * x_clamp
2025-05-07T20:32:55.4129945Z         x0 = x[:, :D]
2025-05-07T20:32:55.4130181Z         x1 = x[:, D:]
2025-05-07T20:32:55.4130427Z     
2025-05-07T20:32:55.4130616Z         if contiguous:
2025-05-07T20:32:55.4130847Z             x0 = x0.contiguous()
2025-05-07T20:32:55.4131100Z             x1 = x1.contiguous()
2025-05-07T20:32:55.4131341Z     
2025-05-07T20:32:55.4131528Z         if scale_ub is not None:
2025-05-07T20:32:55.4131798Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.4132133Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.4132439Z             )
2025-05-07T20:32:55.4132637Z         else:
2025-05-07T20:32:55.4132845Z             scale_ub_tensor = None
2025-05-07T20:32:55.4133087Z     
2025-05-07T20:32:55.4133322Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4133633Z             op = silu_mul_quant
2025-05-07T20:32:55.4133880Z             if compiled:
2025-05-07T20:32:55.4134134Z                 op = torch.compile(op)
2025-05-07T20:32:55.4134437Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4134832Z     
2025-05-07T20:32:55.4135021Z         y_fp8, y_scale = fn()
2025-05-07T20:32:55.4135315Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:55.4135604Z     
2025-05-07T20:32:55.4135835Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4136161Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:55.4136449Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:55.4136752Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:55.4137106Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:55.4137410Z     
2025-05-07T20:32:55.4137598Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:55.4137791Z 
2025-05-07T20:32:55.4137892Z moe/activation_test.py:126: 
2025-05-07T20:32:55.4138185Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4138518Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:55.4138840Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:55.4139641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:55.4140647Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:55.4141184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.4141857Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.4142544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:55.4143258Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:55.4143966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:55.4144593Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:55.4145310Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:55.4145813Z     fn()
2025-05-07T20:32:55.4146308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:55.4146892Z     self.fn.run(
2025-05-07T20:32:55.4147348Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.4147940Z     kernel = self.compile(
2025-05-07T20:32:55.4148494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.4149153Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.4149539Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4149767Z 
2025-05-07T20:32:55.4149971Z self = <triton.compiler.compiler.ASTSource object at 0x7f37ae8ee840>
2025-05-07T20:32:55.4151041Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.4152398Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37ae10a840>}
2025-05-07T20:32:55.4153713Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.4154761Z context = <triton._C.libtriton.ir.context object at 0x7f378963bf70>
2025-05-07T20:32:55.4155054Z 
2025-05-07T20:32:55.4155219Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.4155861Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.4156333Z                            module_map=module_map)
2025-05-07T20:32:55.4156690Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.4157046Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:55.4157308Z E       ^
2025-05-07T20:32:55.4157754Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4158211Z 
2025-05-07T20:32:55.4158628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.4159135Z 
2025-05-07T20:32:55.4159235Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:55.4159639Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:55.4160030Z     T=128,
2025-05-07T20:32:55.4160244Z     D=5120,
2025-05-07T20:32:55.4160458Z     scale_ub=None,
2025-05-07T20:32:55.4160658Z     contiguous=True,
2025-05-07T20:32:55.4160878Z     compiled=True,
2025-05-07T20:32:55.4161087Z )
2025-05-07T20:32:55.4161398Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:55.4161873Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:55.4162137Z 
2025-05-07T20:32:55.4162216Z     @given(
2025-05-07T20:32:55.4162441Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:55.4162739Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:55.4163039Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:55.4163366Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:55.4163679Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:55.4163954Z     )
2025-05-07T20:32:55.4164292Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:55.4164735Z     def test_silu_mul_quant(
2025-05-07T20:32:55.4164973Z         self,
2025-05-07T20:32:55.4165162Z         T: int,
2025-05-07T20:32:55.4165356Z         D: int,
2025-05-07T20:32:55.4165647Z         scale_ub: Optional[float],
2025-05-07T20:32:55.4165916Z         contiguous: bool,
2025-05-07T20:32:55.4166153Z         compiled: bool,
2025-05-07T20:32:55.4166371Z     ) -> None:
2025-05-07T20:32:55.4166582Z         torch.manual_seed(2025)
2025-05-07T20:32:55.4166812Z     
2025-05-07T20:32:55.4167074Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:55.4167409Z     
2025-05-07T20:32:55.4167596Z         x_sign = torch.sign(x)
2025-05-07T20:32:55.4167875Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:55.4168181Z         x = x_sign * x_clamp
2025-05-07T20:32:55.4168416Z         x0 = x[:, :D]
2025-05-07T20:32:55.4168633Z         x1 = x[:, D:]
2025-05-07T20:32:55.4168844Z     
2025-05-07T20:32:55.4169030Z         if contiguous:
2025-05-07T20:32:55.4169259Z             x0 = x0.contiguous()
2025-05-07T20:32:55.4169515Z             x1 = x1.contiguous()
2025-05-07T20:32:55.4169751Z     
2025-05-07T20:32:55.4169943Z         if scale_ub is not None:
2025-05-07T20:32:55.4170236Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:55.4170591Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:55.4170899Z             )
2025-05-07T20:32:55.4171086Z         else:
2025-05-07T20:32:55.4171291Z             scale_ub_tensor = None
2025-05-07T20:32:55.4171541Z     
2025-05-07T20:32:55.4171765Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4172071Z             op = silu_mul_quant
2025-05-07T20:32:55.4172313Z             if compiled:
2025-05-07T20:32:55.4172552Z                 op = torch.compile(op)
2025-05-07T20:32:55.4172844Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:55.4173109Z     
2025-05-07T20:32:55.4173290Z         y_fp8, y_scale = fn()
2025-05-07T20:32:55.4173681Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:55.4173962Z     
2025-05-07T20:32:55.4174185Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:55.4174524Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:55.4174809Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:55.4175112Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:55.4175461Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:55.4175766Z     
2025-05-07T20:32:55.4175968Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:55.4176160Z 
2025-05-07T20:32:55.4176258Z moe/activation_test.py:126: 
2025-05-07T20:32:55.4176554Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4176893Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:55.4177210Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:55.4178016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:55.4178781Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:55.4179345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:55.4180035Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:55.4180714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:55.4181433Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:55.4182150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:55.4182772Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:55.4183374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:55.4183892Z     fn()
2025-05-07T20:32:55.4184479Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:55.4185072Z     self.fn.run(
2025-05-07T20:32:55.4185536Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:55.4186054Z     kernel = self.compile(
2025-05-07T20:32:55.4186589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:55.4187223Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.4187665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:55.4193915Z 
2025-05-07T20:32:55.4194151Z self = <triton.compiler.compiler.ASTSource object at 0x7f37ae8de850>
2025-05-07T20:32:55.4195290Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:55.4196642Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37b43aa160>}
2025-05-07T20:32:55.4197997Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:55.4199008Z context = <triton._C.libtriton.ir.context object at 0x7f3788ac1670>
2025-05-07T20:32:55.4199290Z 
2025-05-07T20:32:55.4199454Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:55.4199970Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.4200538Z                            module_map=module_map)
2025-05-07T20:32:55.4200895Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.4201249Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:55.4201504Z E       ^
2025-05-07T20:32:55.4201961Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.4202404Z 
2025-05-07T20:32:55.4202826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:55.6468161Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:55.6470603Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last):
2025-05-07T20:32:55.6472012Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:55.6473416Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:55.6474377Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:55.6475685Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:55.6477217Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.6478679Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:55.6480061Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.6481080Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                        module_map=module_map)
2025-05-07T20:32:55.6482309Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:55.6483528Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     generator.visit(fn.parse())
2025-05-07T20:32:55.6484380Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:55.6485547Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:55.6486730Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ret = super().visit(node)
2025-05-07T20:32:55.6487747Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:55.6488884Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return visitor(node)
2025-05-07T20:32:55.6490207Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:55.6491455Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:55.6492325Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:55.6493379Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:55.6494394Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     self.visit(item)
2025-05-07T20:32:55.6495144Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:55.6496285Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:55.6497604Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:55.6498628Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.6499508Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.6500233Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^
2025-05-07T20:32:55.6501346Z W0507 20:32:55.643000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:55.7080704Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:55.7081948Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] Traceback (most recent call last):
2025-05-07T20:32:55.7083262Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:55.7084669Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:55.7085619Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:55.7086895Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:55.7088246Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:55.7089519Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:55.7091090Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:55.7092114Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]                        module_map=module_map)
2025-05-07T20:32:55.7093346Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:55.7094611Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     generator.visit(fn.parse())
2025-05-07T20:32:55.7095434Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:55.7096666Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:55.7097850Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ret = super().visit(node)
2025-05-07T20:32:55.7098856Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:55.7099853Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     return visitor(node)
2025-05-07T20:32:55.7101089Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:55.7102449Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:55.7103325Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:55.7104376Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:55.7105385Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     self.visit(item)
2025-05-07T20:32:55.7106132Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:55.7107284Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:55.7108681Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:55.7109715Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:55.7110622Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:55.7111333Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ^
2025-05-07T20:32:55.7112317Z W0507 20:32:55.704000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/6] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:56.2513105Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:56.2514418Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last):
2025-05-07T20:32:56.2515743Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:56.2517141Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:56.2518114Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:56.2519398Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:56.2520778Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:56.2522115Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:56.2523478Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:56.2524687Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                        module_map=module_map)
2025-05-07T20:32:56.2525988Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:56.2527215Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     generator.visit(fn.parse())
2025-05-07T20:32:56.2528053Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:56.2529246Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:56.2530485Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ret = super().visit(node)
2025-05-07T20:32:56.2531519Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:56.2532528Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return visitor(node)
2025-05-07T20:32:56.2533729Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:56.2535006Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:56.2536061Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:56.2537128Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:56.2538156Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     self.visit(item)
2025-05-07T20:32:56.2538912Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:56.2540252Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:56.2541650Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:56.2542692Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:56.2543592Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:56.2544325Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^
2025-05-07T20:32:56.2545331Z W0507 20:32:56.247000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:56.3132829Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
2025-05-07T20:32:56.3135398Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] Traceback (most recent call last):
2025-05-07T20:32:56.3138005Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 731, in identify_mutated_tensors
2025-05-07T20:32:56.3140839Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module, ordered_tensor_names = generate_ttir(kernel, kwargs)
2025-05-07T20:32:56.3141805Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                                         ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
2025-05-07T20:32:56.3143102Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_higher_order_ops/triton_kernel_wrap.py", line 356, in generate_ttir
2025-05-07T20:32:56.3144470Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ttir_module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:56.3145811Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:32:56.3147163Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:56.3148273Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]                        module_map=module_map)
2025-05-07T20:32:56.3149648Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1298, in ast_to_ttir
2025-05-07T20:32:56.3150925Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     generator.visit(fn.parse())
2025-05-07T20:32:56.3151749Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:56.3152931Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1201, in visit
2025-05-07T20:32:56.3154113Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ret = super().visit(node)
2025-05-07T20:32:56.3155132Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 428, in visit
2025-05-07T20:32:56.3156129Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     return visitor(node)
2025-05-07T20:32:56.3157323Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 352, in visit_Module
2025-05-07T20:32:56.3158582Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ast.NodeVisitor.generic_visit(self, node)
2025-05-07T20:32:56.3159465Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:32:56.3160689Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/ast.py", line 436, in generic_visit
2025-05-07T20:32:56.3161728Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     self.visit(item)
2025-05-07T20:32:56.3162488Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     ~~~~~~~~~~^^^^^^
2025-05-07T20:32:56.3163647Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/code_generator.py", line 1207, in visit
2025-05-07T20:32:56.3164984Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7]     raise CompilationError(self.jit_fn.src, self.cur_node, repr(e)) from None
2025-05-07T20:32:56.3166065Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:56.3166968Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] def _fbgemm_silu_mul_quant(
2025-05-07T20:32:56.3167710Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ^
2025-05-07T20:32:56.3168719Z W0507 20:32:56.309000 88852 site-packages/torch/_higher_order_ops/triton_kernel_wrap.py:752] [0/7] ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:56.6200932Z 
2025-05-07T20:32:56.6201423Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:56.6202468Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:56.6203339Z     T=4096,
2025-05-07T20:32:56.6203716Z     D=5120,
2025-05-07T20:32:56.6204456Z     scale_ub=None,
2025-05-07T20:32:56.6204861Z     contiguous=True,
2025-05-07T20:32:56.6205293Z     compiled=True,
2025-05-07T20:32:56.6205680Z )
2025-05-07T20:32:56.6206300Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:56.6207283Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:56.6207826Z 
2025-05-07T20:32:56.6207990Z     @given(
2025-05-07T20:32:56.6208443Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:56.6209042Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:56.6209644Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:56.6210297Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:56.6210654Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:56.6210938Z     )
2025-05-07T20:32:56.6211284Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:56.6211733Z     def test_silu_mul_quant(
2025-05-07T20:32:56.6211979Z         self,
2025-05-07T20:32:56.6212174Z         T: int,
2025-05-07T20:32:56.6212362Z         D: int,
2025-05-07T20:32:56.6212576Z         scale_ub: Optional[float],
2025-05-07T20:32:56.6212843Z         contiguous: bool,
2025-05-07T20:32:56.6213073Z         compiled: bool,
2025-05-07T20:32:56.6213289Z     ) -> None:
2025-05-07T20:32:56.6213503Z         torch.manual_seed(2025)
2025-05-07T20:32:56.6213753Z     
2025-05-07T20:32:56.6214012Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:56.6214351Z     
2025-05-07T20:32:56.6214542Z         x_sign = torch.sign(x)
2025-05-07T20:32:56.6214825Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:56.6215141Z         x = x_sign * x_clamp
2025-05-07T20:32:56.6215381Z         x0 = x[:, :D]
2025-05-07T20:32:56.6215584Z         x1 = x[:, D:]
2025-05-07T20:32:56.6215803Z     
2025-05-07T20:32:56.6215997Z         if contiguous:
2025-05-07T20:32:56.6216230Z             x0 = x0.contiguous()
2025-05-07T20:32:56.6216484Z             x1 = x1.contiguous()
2025-05-07T20:32:56.6216720Z     
2025-05-07T20:32:56.6216916Z         if scale_ub is not None:
2025-05-07T20:32:56.6217306Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:56.6217650Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:56.6217967Z             )
2025-05-07T20:32:56.6218163Z         else:
2025-05-07T20:32:56.6218371Z             scale_ub_tensor = None
2025-05-07T20:32:56.6218632Z     
2025-05-07T20:32:56.6218868Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:56.6219176Z             op = silu_mul_quant
2025-05-07T20:32:56.6219423Z             if compiled:
2025-05-07T20:32:56.6219672Z                 op = torch.compile(op)
2025-05-07T20:32:56.6219961Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:56.6220234Z     
2025-05-07T20:32:56.6220431Z         y_fp8, y_scale = fn()
2025-05-07T20:32:56.6220704Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:56.6220996Z     
2025-05-07T20:32:56.6221228Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:56.6221556Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:56.6221842Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:56.6222148Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:56.6222500Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:56.6222802Z     
2025-05-07T20:32:56.6223003Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:56.6223195Z 
2025-05-07T20:32:56.6223301Z moe/activation_test.py:126: 
2025-05-07T20:32:56.6223588Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.6223915Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:56.6224240Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:56.6225014Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:56.6225856Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:56.6226404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:56.6227078Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:56.6227831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:56.6228543Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:56.6229266Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:56.6229894Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:56.6230486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:56.6230999Z     fn()
2025-05-07T20:32:56.6231518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:56.6232100Z     self.fn.run(
2025-05-07T20:32:56.6232565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:56.6233082Z     kernel = self.compile(
2025-05-07T20:32:56.6233631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:56.6234266Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:56.6234659Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.6234882Z 
2025-05-07T20:32:56.6235090Z self = <triton.compiler.compiler.ASTSource object at 0x7f37ae3c3a50>
2025-05-07T20:32:56.6236163Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:56.6237604Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3788e26de0>}
2025-05-07T20:32:56.6238919Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:56.6239975Z context = <triton._C.libtriton.ir.context object at 0x7f37895ab9b0>
2025-05-07T20:32:56.6240425Z 
2025-05-07T20:32:56.6240595Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:56.6241103Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:56.6241576Z                            module_map=module_map)
2025-05-07T20:32:56.6241941Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:56.6242298Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:56.6242561Z E       ^
2025-05-07T20:32:56.6243023Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:56.6243478Z 
2025-05-07T20:32:56.6243920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:56.6244422Z 
2025-05-07T20:32:56.6244529Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:56.6244930Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:56.6245339Z     T=16384,
2025-05-07T20:32:56.6245529Z     D=5120,
2025-05-07T20:32:56.6245718Z     scale_ub=None,
2025-05-07T20:32:56.6245930Z     contiguous=True,
2025-05-07T20:32:56.6246150Z     compiled=True,
2025-05-07T20:32:56.6246344Z )
2025-05-07T20:32:56.6246865Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:56.6247360Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:56.6247629Z 
2025-05-07T20:32:56.6247706Z     @given(
2025-05-07T20:32:56.6247933Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:56.6248241Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:56.6248537Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:56.6248857Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:56.6249177Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:56.6249453Z     )
2025-05-07T20:32:56.6249798Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:56.6250247Z     def test_silu_mul_quant(
2025-05-07T20:32:56.6250483Z         self,
2025-05-07T20:32:56.6250669Z         T: int,
2025-05-07T20:32:56.6250862Z         D: int,
2025-05-07T20:32:56.6251087Z         scale_ub: Optional[float],
2025-05-07T20:32:56.6251345Z         contiguous: bool,
2025-05-07T20:32:56.6251586Z         compiled: bool,
2025-05-07T20:32:56.6251809Z     ) -> None:
2025-05-07T20:32:56.6252017Z         torch.manual_seed(2025)
2025-05-07T20:32:56.6252248Z     
2025-05-07T20:32:56.6252510Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:56.6252846Z     
2025-05-07T20:32:56.6253029Z         x_sign = torch.sign(x)
2025-05-07T20:32:56.6253313Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:56.6253624Z         x = x_sign * x_clamp
2025-05-07T20:32:56.6253857Z         x0 = x[:, :D]
2025-05-07T20:32:56.6254069Z         x1 = x[:, D:]
2025-05-07T20:32:56.6254275Z     
2025-05-07T20:32:56.6254451Z         if contiguous:
2025-05-07T20:32:56.6254676Z             x0 = x0.contiguous()
2025-05-07T20:32:56.6254932Z             x1 = x1.contiguous()
2025-05-07T20:32:56.6255161Z     
2025-05-07T20:32:56.6255349Z         if scale_ub is not None:
2025-05-07T20:32:56.6255629Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:56.6255956Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:56.6256387Z             )
2025-05-07T20:32:56.6256585Z         else:
2025-05-07T20:32:56.6256789Z             scale_ub_tensor = None
2025-05-07T20:32:56.6257038Z     
2025-05-07T20:32:56.6257269Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:56.6257575Z             op = silu_mul_quant
2025-05-07T20:32:56.6257828Z             if compiled:
2025-05-07T20:32:56.6258083Z                 op = torch.compile(op)
2025-05-07T20:32:56.6258376Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:56.6258634Z     
2025-05-07T20:32:56.6258826Z         y_fp8, y_scale = fn()
2025-05-07T20:32:56.6259104Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:56.6259378Z     
2025-05-07T20:32:56.6259611Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:56.6259946Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:56.6260228Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:56.6260578Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:56.6260950Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:56.6261250Z     
2025-05-07T20:32:56.6261449Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:56.6261647Z 
2025-05-07T20:32:56.6261743Z moe/activation_test.py:126: 
2025-05-07T20:32:56.6262038Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.6262359Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:56.6262683Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:56.6263481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:56.6264228Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:56.6264869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:56.6265559Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:56.6266243Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:56.6266945Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:56.6267735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:56.6268366Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:56.6268967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:56.6269469Z     fn()
2025-05-07T20:32:56.6269993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:56.6270602Z     self.fn.run(
2025-05-07T20:32:56.6271065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:56.6271584Z     kernel = self.compile(
2025-05-07T20:32:56.6272141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:56.6272805Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:56.6273195Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:56.6273425Z 
2025-05-07T20:32:56.6273630Z self = <triton.compiler.compiler.ASTSource object at 0x7f37ae3c30d0>
2025-05-07T20:32:56.6274742Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:56.6276272Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3788d477e0>}
2025-05-07T20:32:56.6277581Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:56.6278588Z context = <triton._C.libtriton.ir.context object at 0x7f378946e4b0>
2025-05-07T20:32:56.6278870Z 
2025-05-07T20:32:56.6279033Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:56.6279545Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:56.6280001Z                            module_map=module_map)
2025-05-07T20:32:56.6280361Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:56.6280713Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:56.6280974Z E       ^
2025-05-07T20:32:56.6281445Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:56.6281914Z 
2025-05-07T20:32:56.6282337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:56.6478383Z W0507 20:32:56.646000 88852 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:32:56.6479682Z W0507 20:32:56.646000 88852 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:32:56.6481555Z W0507 20:32:56.646000 88852 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:32:56.6483822Z W0507 20:32:56.646000 88852 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:32:56.6486104Z W0507 20:32:56.646000 88852 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:32:57.0742213Z 
2025-05-07T20:32:57.0742598Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.0743143Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.0743794Z     T=1,
2025-05-07T20:32:57.0744059Z     D=5120,
2025-05-07T20:32:57.0744343Z     scale_ub=1200.0,
2025-05-07T20:32:57.0744643Z     contiguous=True,
2025-05-07T20:32:57.0744889Z     compiled=True,
2025-05-07T20:32:57.0745107Z )
2025-05-07T20:32:57.0745428Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.0745924Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:57.0746199Z 
2025-05-07T20:32:57.0746300Z     @given(
2025-05-07T20:32:57.0746533Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.0746847Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.0747153Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.0747548Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.0747865Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.0748149Z     )
2025-05-07T20:32:57.0754771Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.0755323Z     def test_silu_mul_quant(
2025-05-07T20:32:57.0755587Z         self,
2025-05-07T20:32:57.0755786Z         T: int,
2025-05-07T20:32:57.0755990Z         D: int,
2025-05-07T20:32:57.0756220Z         scale_ub: Optional[float],
2025-05-07T20:32:57.0756509Z         contiguous: bool,
2025-05-07T20:32:57.0756767Z         compiled: bool,
2025-05-07T20:32:57.0757017Z     ) -> None:
2025-05-07T20:32:57.0757238Z         torch.manual_seed(2025)
2025-05-07T20:32:57.0757483Z     
2025-05-07T20:32:57.0757911Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.0758264Z     
2025-05-07T20:32:57.0758449Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.0758731Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.0759031Z         x = x_sign * x_clamp
2025-05-07T20:32:57.0759258Z         x0 = x[:, :D]
2025-05-07T20:32:57.0759471Z         x1 = x[:, D:]
2025-05-07T20:32:57.0759678Z     
2025-05-07T20:32:57.0759853Z         if contiguous:
2025-05-07T20:32:57.0760077Z             x0 = x0.contiguous()
2025-05-07T20:32:57.0760326Z             x1 = x1.contiguous()
2025-05-07T20:32:57.0760557Z     
2025-05-07T20:32:57.0760743Z         if scale_ub is not None:
2025-05-07T20:32:57.0761012Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.0761337Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.0761647Z             )
2025-05-07T20:32:57.0761832Z         else:
2025-05-07T20:32:57.0762031Z             scale_ub_tensor = None
2025-05-07T20:32:57.0762284Z     
2025-05-07T20:32:57.0762511Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.0762817Z             op = silu_mul_quant
2025-05-07T20:32:57.0763054Z             if compiled:
2025-05-07T20:32:57.0763294Z                 op = torch.compile(op)
2025-05-07T20:32:57.0763584Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.0763851Z     
2025-05-07T20:32:57.0764035Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:57.0764196Z 
2025-05-07T20:32:57.0764296Z moe/activation_test.py:117: 
2025-05-07T20:32:57.0764578Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.0764897Z moe/activation_test.py:115: in fn
2025-05-07T20:32:57.0765169Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.0765733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:57.0766411Z     return fn(*args, **kwargs)
2025-05-07T20:32:57.0767063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:57.0767733Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:57.0768275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.0768937Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.0769592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.0770107Z     kernel = self.compile(
2025-05-07T20:32:57.0770661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.0771335Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.0771723Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.0771946Z 
2025-05-07T20:32:57.0772152Z self = <triton.compiler.compiler.ASTSource object at 0x7f37896e04d0>
2025-05-07T20:32:57.0773255Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.0774652Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3788de1440>}
2025-05-07T20:32:57.0775993Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.0777047Z context = <triton._C.libtriton.ir.context object at 0x7f37894d5b30>
2025-05-07T20:32:57.0777330Z 
2025-05-07T20:32:57.0777627Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.0778140Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.0778601Z                            module_map=module_map)
2025-05-07T20:32:57.0778956Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.0779300Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:57.0779550Z E       ^
2025-05-07T20:32:57.0780009Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.0780458Z 
2025-05-07T20:32:57.0780861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.0781370Z 
2025-05-07T20:32:57.0781468Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.0781873Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.0782261Z     T=1,
2025-05-07T20:32:57.0782439Z     D=5120,
2025-05-07T20:32:57.0782624Z     scale_ub=None,
2025-05-07T20:32:57.0782836Z     contiguous=False,
2025-05-07T20:32:57.0783052Z     compiled=True,
2025-05-07T20:32:57.0783249Z )
2025-05-07T20:32:57.0783562Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.0784042Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:57.0784296Z 
2025-05-07T20:32:57.0784373Z     @given(
2025-05-07T20:32:57.0784592Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.0784888Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.0785184Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.0785504Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.0785820Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.0786174Z     )
2025-05-07T20:32:57.0786518Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.0786972Z     def test_silu_mul_quant(
2025-05-07T20:32:57.0787203Z         self,
2025-05-07T20:32:57.0787477Z         T: int,
2025-05-07T20:32:57.0787670Z         D: int,
2025-05-07T20:32:57.0787876Z         scale_ub: Optional[float],
2025-05-07T20:32:57.0788141Z         contiguous: bool,
2025-05-07T20:32:57.0788376Z         compiled: bool,
2025-05-07T20:32:57.0788585Z     ) -> None:
2025-05-07T20:32:57.0788795Z         torch.manual_seed(2025)
2025-05-07T20:32:57.0789035Z     
2025-05-07T20:32:57.0789293Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.0789627Z     
2025-05-07T20:32:57.0789809Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.0790085Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.0790383Z         x = x_sign * x_clamp
2025-05-07T20:32:57.0790629Z         x0 = x[:, :D]
2025-05-07T20:32:57.0790839Z         x1 = x[:, D:]
2025-05-07T20:32:57.0791038Z     
2025-05-07T20:32:57.0791212Z         if contiguous:
2025-05-07T20:32:57.0791441Z             x0 = x0.contiguous()
2025-05-07T20:32:57.0791689Z             x1 = x1.contiguous()
2025-05-07T20:32:57.0791920Z     
2025-05-07T20:32:57.0792104Z         if scale_ub is not None:
2025-05-07T20:32:57.0792367Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.0792690Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.0792996Z             )
2025-05-07T20:32:57.0793177Z         else:
2025-05-07T20:32:57.0793380Z             scale_ub_tensor = None
2025-05-07T20:32:57.0793622Z     
2025-05-07T20:32:57.0793841Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.0794144Z             op = silu_mul_quant
2025-05-07T20:32:57.0794392Z             if compiled:
2025-05-07T20:32:57.0794625Z                 op = torch.compile(op)
2025-05-07T20:32:57.0794918Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.0795183Z     
2025-05-07T20:32:57.0795368Z         y_fp8, y_scale = fn()
2025-05-07T20:32:57.0795723Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:57.0796005Z     
2025-05-07T20:32:57.0796234Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.0796552Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:57.0796833Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:57.0797131Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:57.0797475Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:57.0797775Z     
2025-05-07T20:32:57.0797968Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:57.0798154Z 
2025-05-07T20:32:57.0798250Z moe/activation_test.py:126: 
2025-05-07T20:32:57.0798537Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.0798857Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:57.0799177Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:57.0799945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:57.0800725Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:57.0801279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.0801941Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.0802616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:57.0803328Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:57.0804043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:57.0804746Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:57.0805342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:57.0805847Z     fn()
2025-05-07T20:32:57.0806347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:57.0806928Z     self.fn.run(
2025-05-07T20:32:57.0807386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.0807898Z     kernel = self.compile(
2025-05-07T20:32:57.0808431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.0809066Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.0809448Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.0809675Z 
2025-05-07T20:32:57.0809883Z self = <triton.compiler.compiler.ASTSource object at 0x7f37aeb48b50>
2025-05-07T20:32:57.0810939Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.0812292Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3789f0e5c0>}
2025-05-07T20:32:57.0813603Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.0814597Z context = <triton._C.libtriton.ir.context object at 0x7f35d7d4b3f0>
2025-05-07T20:32:57.0814874Z 
2025-05-07T20:32:57.0815038Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.0815544Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.0816087Z                            module_map=module_map)
2025-05-07T20:32:57.0816445Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.0816794Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:57.0817051Z E       ^
2025-05-07T20:32:57.0817500Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.0817948Z 
2025-05-07T20:32:57.0818375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.2261990Z 
2025-05-07T20:32:57.2262872Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.2264051Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.2265139Z     T=1,
2025-05-07T20:32:57.2265700Z     D=5120,
2025-05-07T20:32:57.2266213Z     scale_ub=None,
2025-05-07T20:32:57.2266633Z     contiguous=True,
2025-05-07T20:32:57.2267057Z     compiled=False,
2025-05-07T20:32:57.2267572Z )
﻿2025-05-07T20:32:57.2274109Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.2274597Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:57.2274864Z 
2025-05-07T20:32:57.2274940Z     @given(
2025-05-07T20:32:57.2275169Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.2275471Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.2275778Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.2276105Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.2276423Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.2276708Z     )
2025-05-07T20:32:57.2277064Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.2277608Z     def test_silu_mul_quant(
2025-05-07T20:32:57.2277840Z         self,
2025-05-07T20:32:57.2278028Z         T: int,
2025-05-07T20:32:57.2278215Z         D: int,
2025-05-07T20:32:57.2278430Z         scale_ub: Optional[float],
2025-05-07T20:32:57.2278717Z         contiguous: bool,
2025-05-07T20:32:57.2278954Z         compiled: bool,
2025-05-07T20:32:57.2279176Z     ) -> None:
2025-05-07T20:32:57.2279393Z         torch.manual_seed(2025)
2025-05-07T20:32:57.2279625Z     
2025-05-07T20:32:57.2279900Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.2280243Z     
2025-05-07T20:32:57.2280426Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.2280715Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.2281025Z         x = x_sign * x_clamp
2025-05-07T20:32:57.2281264Z         x0 = x[:, :D]
2025-05-07T20:32:57.2281472Z         x1 = x[:, D:]
2025-05-07T20:32:57.2281676Z     
2025-05-07T20:32:57.2281859Z         if contiguous:
2025-05-07T20:32:57.2282090Z             x0 = x0.contiguous()
2025-05-07T20:32:57.2282341Z             x1 = x1.contiguous()
2025-05-07T20:32:57.2282582Z     
2025-05-07T20:32:57.2282771Z         if scale_ub is not None:
2025-05-07T20:32:57.2283052Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.2283385Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.2283688Z             )
2025-05-07T20:32:57.2283876Z         else:
2025-05-07T20:32:57.2284085Z             scale_ub_tensor = None
2025-05-07T20:32:57.2284323Z     
2025-05-07T20:32:57.2284553Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.2284853Z             op = silu_mul_quant
2025-05-07T20:32:57.2285091Z             if compiled:
2025-05-07T20:32:57.2285331Z                 op = torch.compile(op)
2025-05-07T20:32:57.2285625Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.2285890Z     
2025-05-07T20:32:57.2286074Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:57.2286244Z 
2025-05-07T20:32:57.2286338Z moe/activation_test.py:117: 
2025-05-07T20:32:57.2286624Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.2287066Z moe/activation_test.py:115: in fn
2025-05-07T20:32:57.2287344Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.2288042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:57.2288714Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:57.2289241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.2289902Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.2290566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.2291118Z     kernel = self.compile(
2025-05-07T20:32:57.2291656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.2292299Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.2292770Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.2292990Z 
2025-05-07T20:32:57.2293189Z self = <triton.compiler.compiler.ASTSource object at 0x7f35d7c29450>
2025-05-07T20:32:57.2294245Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.2295591Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37ae094fe0>}
2025-05-07T20:32:57.2296906Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.2297951Z context = <triton._C.libtriton.ir.context object at 0x7f35d7d0b530>
2025-05-07T20:32:57.2298245Z 
2025-05-07T20:32:57.2298410Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.2298925Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.2299380Z                            module_map=module_map)
2025-05-07T20:32:57.2299734Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.2300078Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:57.2300329Z E       ^
2025-05-07T20:32:57.2300782Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.2301251Z 
2025-05-07T20:32:57.2301657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.2302167Z 
2025-05-07T20:32:57.2302266Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.2302672Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.2303060Z     T=128,
2025-05-07T20:32:57.2303246Z     D=5120,
2025-05-07T20:32:57.2303438Z     scale_ub=None,
2025-05-07T20:32:57.2303642Z     contiguous=False,
2025-05-07T20:32:57.2303863Z     compiled=True,
2025-05-07T20:32:57.2304065Z )
2025-05-07T20:32:57.2304375Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.2304856Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:57.2305120Z 
2025-05-07T20:32:57.2305195Z     @given(
2025-05-07T20:32:57.2305417Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.2305714Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.2306006Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.2306325Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.2306715Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.2306993Z     )
2025-05-07T20:32:57.2307331Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.2307849Z     def test_silu_mul_quant(
2025-05-07T20:32:57.2308086Z         self,
2025-05-07T20:32:57.2308273Z         T: int,
2025-05-07T20:32:57.2308461Z         D: int,
2025-05-07T20:32:57.2308672Z         scale_ub: Optional[float],
2025-05-07T20:32:57.2308939Z         contiguous: bool,
2025-05-07T20:32:57.2309168Z         compiled: bool,
2025-05-07T20:32:57.2309375Z     ) -> None:
2025-05-07T20:32:57.2309580Z         torch.manual_seed(2025)
2025-05-07T20:32:57.2309806Z     
2025-05-07T20:32:57.2310057Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.2310383Z     
2025-05-07T20:32:57.2310560Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.2310836Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.2311137Z         x = x_sign * x_clamp
2025-05-07T20:32:57.2311375Z         x0 = x[:, :D]
2025-05-07T20:32:57.2311645Z         x1 = x[:, D:]
2025-05-07T20:32:57.2311843Z     
2025-05-07T20:32:57.2312015Z         if contiguous:
2025-05-07T20:32:57.2312242Z             x0 = x0.contiguous()
2025-05-07T20:32:57.2312492Z             x1 = x1.contiguous()
2025-05-07T20:32:57.2312723Z     
2025-05-07T20:32:57.2312900Z         if scale_ub is not None:
2025-05-07T20:32:57.2313172Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.2313495Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.2313795Z             )
2025-05-07T20:32:57.2313977Z         else:
2025-05-07T20:32:57.2314181Z             scale_ub_tensor = None
2025-05-07T20:32:57.2314423Z     
2025-05-07T20:32:57.2314639Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.2314992Z             op = silu_mul_quant
2025-05-07T20:32:57.2315235Z             if compiled:
2025-05-07T20:32:57.2315474Z                 op = torch.compile(op)
2025-05-07T20:32:57.2315776Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.2316040Z     
2025-05-07T20:32:57.2316218Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:57.2316380Z 
2025-05-07T20:32:57.2316476Z moe/activation_test.py:117: 
2025-05-07T20:32:57.2316760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.2317071Z moe/activation_test.py:115: in fn
2025-05-07T20:32:57.2317344Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.2317906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:57.2318447Z     return fn(*args, **kwargs)
2025-05-07T20:32:57.2319087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:57.2319759Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:57.2320295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.2321017Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.2321659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.2322174Z     kernel = self.compile(
2025-05-07T20:32:57.2322706Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.2323363Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.2323746Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.2323973Z 
2025-05-07T20:32:57.2324170Z self = <triton.compiler.compiler.ASTSource object at 0x7f37889da150>
2025-05-07T20:32:57.2325319Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.2326660Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37ae1b0360>}
2025-05-07T20:32:57.2328013Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.2329071Z context = <triton._C.libtriton.ir.context object at 0x7f35d7dd1f70>
2025-05-07T20:32:57.2329351Z 
2025-05-07T20:32:57.2329516Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.2330030Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.2330492Z                            module_map=module_map)
2025-05-07T20:32:57.2330854Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.2331193Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:57.2331485Z E       ^
2025-05-07T20:32:57.2331944Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.2332390Z 
2025-05-07T20:32:57.2332820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.2333320Z 
2025-05-07T20:32:57.2333425Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.2333820Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.2334234Z     T=128,
2025-05-07T20:32:57.2334424Z     D=7168,
2025-05-07T20:32:57.2334605Z     scale_ub=1200.0,
2025-05-07T20:32:57.2334822Z     contiguous=False,
2025-05-07T20:32:57.2335089Z     compiled=False,
2025-05-07T20:32:57.3890948Z )
2025-05-07T20:32:57.3891334Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.3892106Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:57.3892502Z 
2025-05-07T20:32:57.3892614Z     @given(
2025-05-07T20:32:57.3892938Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.3893376Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.3893815Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.3894284Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.3894606Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.3894910Z     )
2025-05-07T20:32:57.3895266Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.3895718Z     def test_silu_mul_quant(
2025-05-07T20:32:57.3895970Z         self,
2025-05-07T20:32:57.3896170Z         T: int,
2025-05-07T20:32:57.3896380Z         D: int,
2025-05-07T20:32:57.3896601Z         scale_ub: Optional[float],
2025-05-07T20:32:57.3896875Z         contiguous: bool,
2025-05-07T20:32:57.3897136Z         compiled: bool,
2025-05-07T20:32:57.3897380Z     ) -> None:
2025-05-07T20:32:57.3897609Z         torch.manual_seed(2025)
2025-05-07T20:32:57.3897857Z     
2025-05-07T20:32:57.3898126Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.3898479Z     
2025-05-07T20:32:57.3898681Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.3898974Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.3899298Z         x = x_sign * x_clamp
2025-05-07T20:32:57.3899553Z         x0 = x[:, :D]
2025-05-07T20:32:57.3899769Z         x1 = x[:, D:]
2025-05-07T20:32:57.3899982Z     
2025-05-07T20:32:57.3900168Z         if contiguous:
2025-05-07T20:32:57.3900400Z             x0 = x0.contiguous()
2025-05-07T20:32:57.3900669Z             x1 = x1.contiguous()
2025-05-07T20:32:57.3900914Z     
2025-05-07T20:32:57.3901098Z         if scale_ub is not None:
2025-05-07T20:32:57.3901377Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.3901901Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.3902226Z             )
2025-05-07T20:32:57.3902413Z         else:
2025-05-07T20:32:57.3902622Z             scale_ub_tensor = None
2025-05-07T20:32:57.3902874Z     
2025-05-07T20:32:57.3903098Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.3903405Z             op = silu_mul_quant
2025-05-07T20:32:57.3910231Z             if compiled:
2025-05-07T20:32:57.3910485Z                 op = torch.compile(op)
2025-05-07T20:32:57.3910781Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.3911059Z     
2025-05-07T20:32:57.3911251Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:57.3911420Z 
2025-05-07T20:32:57.3911519Z moe/activation_test.py:117: 
2025-05-07T20:32:57.3911806Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.3912138Z moe/activation_test.py:115: in fn
2025-05-07T20:32:57.3912442Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.3913183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:57.3913978Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:57.3914526Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.3915198Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.3915856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.3916378Z     kernel = self.compile(
2025-05-07T20:32:57.3916919Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.3917637Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.3918037Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.3918259Z 
2025-05-07T20:32:57.3918466Z self = <triton.compiler.compiler.ASTSource object at 0x7f3788f96bd0>
2025-05-07T20:32:57.3919572Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.3920921Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3789c0ec00>}
2025-05-07T20:32:57.3922279Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.3923345Z context = <triton._C.libtriton.ir.context object at 0x7f35d7f0c9b0>
2025-05-07T20:32:57.3923626Z 
2025-05-07T20:32:57.3923794Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.3924313Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.3924777Z                            module_map=module_map)
2025-05-07T20:32:57.3925144Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.3925491Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:57.3925747Z E       ^
2025-05-07T20:32:57.3926203Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.3926647Z 
2025-05-07T20:32:57.3927078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.3927585Z 
2025-05-07T20:32:57.3927687Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.3928088Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.3928558Z     T=128,
2025-05-07T20:32:57.3928741Z     D=5120,
2025-05-07T20:32:57.3928934Z     scale_ub=None,
2025-05-07T20:32:57.3929147Z     contiguous=False,
2025-05-07T20:32:57.3929365Z     compiled=False,
2025-05-07T20:32:57.3929565Z )
2025-05-07T20:32:57.3929880Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.3930372Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:57.3930644Z 
2025-05-07T20:32:57.3930739Z     @given(
2025-05-07T20:32:57.3930989Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.3931288Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.3931582Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.3931901Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.3932218Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.3932490Z     )
2025-05-07T20:32:57.3932834Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.3933321Z     def test_silu_mul_quant(
2025-05-07T20:32:57.3933551Z         self,
2025-05-07T20:32:57.3933744Z         T: int,
2025-05-07T20:32:57.3933935Z         D: int,
2025-05-07T20:32:57.3934143Z         scale_ub: Optional[float],
2025-05-07T20:32:57.3934415Z         contiguous: bool,
2025-05-07T20:32:57.3934678Z         compiled: bool,
2025-05-07T20:32:57.3934900Z     ) -> None:
2025-05-07T20:32:57.3935112Z         torch.manual_seed(2025)
2025-05-07T20:32:57.3935343Z     
2025-05-07T20:32:57.3935611Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.3935955Z     
2025-05-07T20:32:57.3936141Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.3936427Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.3936782Z         x = x_sign * x_clamp
2025-05-07T20:32:57.3937016Z         x0 = x[:, :D]
2025-05-07T20:32:57.3937228Z         x1 = x[:, D:]
2025-05-07T20:32:57.3937429Z     
2025-05-07T20:32:57.3937615Z         if contiguous:
2025-05-07T20:32:57.3937840Z             x0 = x0.contiguous()
2025-05-07T20:32:57.3938087Z             x1 = x1.contiguous()
2025-05-07T20:32:57.3938315Z     
2025-05-07T20:32:57.3938506Z         if scale_ub is not None:
2025-05-07T20:32:57.3938771Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.3939098Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.3939403Z             )
2025-05-07T20:32:57.3939595Z         else:
2025-05-07T20:32:57.3939797Z             scale_ub_tensor = None
2025-05-07T20:32:57.3940036Z     
2025-05-07T20:32:57.3940442Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.3940748Z             op = silu_mul_quant
2025-05-07T20:32:57.3940989Z             if compiled:
2025-05-07T20:32:57.3941228Z                 op = torch.compile(op)
2025-05-07T20:32:57.3941518Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.3941780Z     
2025-05-07T20:32:57.3941969Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:57.3942127Z 
2025-05-07T20:32:57.3942229Z moe/activation_test.py:117: 
2025-05-07T20:32:57.3942511Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.3942831Z moe/activation_test.py:115: in fn
2025-05-07T20:32:57.3943105Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.3943778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:57.3944444Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:57.3944974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.3945647Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.3946301Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.3946820Z     kernel = self.compile(
2025-05-07T20:32:57.3947553Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.3948221Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.3948614Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.3948838Z 
2025-05-07T20:32:57.3949036Z self = <triton.compiler.compiler.ASTSource object at 0x7f3788b977d0>
2025-05-07T20:32:57.3950091Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.3951447Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3788e25e40>}
2025-05-07T20:32:57.3952778Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.3953888Z context = <triton._C.libtriton.ir.context object at 0x7f35d7f667f0>
2025-05-07T20:32:57.3954172Z 
2025-05-07T20:32:57.3954335Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.3954848Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.3955308Z                            module_map=module_map)
2025-05-07T20:32:57.3955661Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.3956013Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:57.3956266Z E       ^
2025-05-07T20:32:57.3956779Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.3957238Z 
2025-05-07T20:32:57.3957672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.3958179Z 
2025-05-07T20:32:57.3958280Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.3958681Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.3959067Z     T=128,
2025-05-07T20:32:57.3959249Z     D=5120,
2025-05-07T20:32:57.3959434Z     scale_ub=1200.0,
2025-05-07T20:32:57.3959640Z     contiguous=True,
2025-05-07T20:32:57.3959860Z     compiled=False,
2025-05-07T20:32:57.3960061Z )
2025-05-07T20:32:57.3960372Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.3960905Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:57.3961165Z 
2025-05-07T20:32:57.3961246Z     @given(
2025-05-07T20:32:57.3961465Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.3961773Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.3962080Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.3962403Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.3962716Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.3962996Z     )
2025-05-07T20:32:57.3963330Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.3963771Z     def test_silu_mul_quant(
2025-05-07T20:32:57.3964008Z         self,
2025-05-07T20:32:57.3964193Z         T: int,
2025-05-07T20:32:57.3964377Z         D: int,
2025-05-07T20:32:57.3964586Z         scale_ub: Optional[float],
2025-05-07T20:32:57.3964847Z         contiguous: bool,
2025-05-07T20:32:57.3965078Z         compiled: bool,
2025-05-07T20:32:57.3965291Z     ) -> None:
2025-05-07T20:32:57.3965499Z         torch.manual_seed(2025)
2025-05-07T20:32:57.3965732Z     
2025-05-07T20:32:57.3965988Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.3966315Z     
2025-05-07T20:32:57.3966590Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.3966875Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.3967180Z         x = x_sign * x_clamp
2025-05-07T20:32:57.3967410Z         x0 = x[:, :D]
2025-05-07T20:32:57.3967614Z         x1 = x[:, D:]
2025-05-07T20:32:57.3967820Z     
2025-05-07T20:32:57.3968009Z         if contiguous:
2025-05-07T20:32:57.3968239Z             x0 = x0.contiguous()
2025-05-07T20:32:57.3968496Z             x1 = x1.contiguous()
2025-05-07T20:32:57.3968726Z     
2025-05-07T20:32:57.3968906Z         if scale_ub is not None:
2025-05-07T20:32:57.3969171Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.3969493Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.3969791Z             )
2025-05-07T20:32:57.3969972Z         else:
2025-05-07T20:32:57.3970173Z             scale_ub_tensor = None
2025-05-07T20:32:57.3970407Z     
2025-05-07T20:32:57.3970631Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.3970931Z             op = silu_mul_quant
2025-05-07T20:32:57.3971219Z             if compiled:
2025-05-07T20:32:57.3971457Z                 op = torch.compile(op)
2025-05-07T20:32:57.3971741Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.3972003Z     
2025-05-07T20:32:57.3972182Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:57.3972343Z 
2025-05-07T20:32:57.3972438Z moe/activation_test.py:117: 
2025-05-07T20:32:57.3972728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.3973047Z moe/activation_test.py:115: in fn
2025-05-07T20:32:57.3973320Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.3974020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:57.3974735Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:57.3975279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.3975945Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.3976598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.3977110Z     kernel = self.compile(
2025-05-07T20:32:57.3977660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.3978293Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.3978670Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.3978888Z 
2025-05-07T20:32:57.3979086Z self = <triton.compiler.compiler.ASTSource object at 0x7f3788f97450>
2025-05-07T20:32:57.3980190Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.3981533Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37ae84a0c0>}
2025-05-07T20:32:57.3982841Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.3983883Z context = <triton._C.libtriton.ir.context object at 0x7f35d7e13eb0>
2025-05-07T20:32:57.3984163Z 
2025-05-07T20:32:57.3984321Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.3984830Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.3985302Z                            module_map=module_map)
2025-05-07T20:32:57.3985754Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.3986103Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:57.3986362Z E       ^
2025-05-07T20:32:57.3986817Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.3987278Z 
2025-05-07T20:32:57.3987755Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.5541031Z 
2025-05-07T20:32:57.5541230Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.5541780Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.5542346Z     T=1,
2025-05-07T20:32:57.5542658Z     D=7168,
2025-05-07T20:32:57.5542942Z     scale_ub=1200.0,
2025-05-07T20:32:57.5543270Z     contiguous=True,
2025-05-07T20:32:57.5543556Z     compiled=True,
2025-05-07T20:32:57.5543771Z )
2025-05-07T20:32:57.5544099Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.5544598Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:57.5544999Z 
2025-05-07T20:32:57.5545078Z     @given(
2025-05-07T20:32:57.5545312Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.5545627Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.5545947Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.5546277Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.5546622Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.5546903Z     )
2025-05-07T20:32:57.5547258Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.5547787Z     def test_silu_mul_quant(
2025-05-07T20:32:57.5548023Z         self,
2025-05-07T20:32:57.5548295Z         T: int,
2025-05-07T20:32:57.5548490Z         D: int,
2025-05-07T20:32:57.5548701Z         scale_ub: Optional[float],
2025-05-07T20:32:57.5548966Z         contiguous: bool,
2025-05-07T20:32:57.5549211Z         compiled: bool,
2025-05-07T20:32:57.5549432Z     ) -> None:
2025-05-07T20:32:57.5549644Z         torch.manual_seed(2025)
2025-05-07T20:32:57.5549878Z     
2025-05-07T20:32:57.5550139Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.5550476Z     
2025-05-07T20:32:57.5550675Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.5551008Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.5551322Z         x = x_sign * x_clamp
2025-05-07T20:32:57.5551562Z         x0 = x[:, :D]
2025-05-07T20:32:57.5551776Z         x1 = x[:, D:]
2025-05-07T20:32:57.5551979Z     
2025-05-07T20:32:57.5552165Z         if contiguous:
2025-05-07T20:32:57.5552405Z             x0 = x0.contiguous()
2025-05-07T20:32:57.5552655Z             x1 = x1.contiguous()
2025-05-07T20:32:57.5552890Z     
2025-05-07T20:32:57.5553080Z         if scale_ub is not None:
2025-05-07T20:32:57.5553346Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.5553683Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.5553990Z             )
2025-05-07T20:32:57.5554178Z         else:
2025-05-07T20:32:57.5554383Z             scale_ub_tensor = None
2025-05-07T20:32:57.5554629Z     
2025-05-07T20:32:57.5554850Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.5555156Z             op = silu_mul_quant
2025-05-07T20:32:57.5555404Z             if compiled:
2025-05-07T20:32:57.5555643Z                 op = torch.compile(op)
2025-05-07T20:32:57.5555937Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.5556209Z     
2025-05-07T20:32:57.5556398Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:57.5556559Z 
2025-05-07T20:32:57.5556658Z moe/activation_test.py:117: 
2025-05-07T20:32:57.5556956Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.5557292Z moe/activation_test.py:115: in fn
2025-05-07T20:32:57.5557686Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.5558254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:57.5558811Z     return fn(*args, **kwargs)
2025-05-07T20:32:57.5559467Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:57.5560137Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:57.5560667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.5561330Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.5561970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.5562493Z     kernel = self.compile(
2025-05-07T20:32:57.5563051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.5563696Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.5564136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.5564363Z 
2025-05-07T20:32:57.5564565Z self = <triton.compiler.compiler.ASTSource object at 0x7f3788b97950>
2025-05-07T20:32:57.5565658Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.5567007Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37ae35f1a0>}
2025-05-07T20:32:57.5568387Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.5569442Z context = <triton._C.libtriton.ir.context object at 0x7f35d7c51f70>
2025-05-07T20:32:57.5569726Z 
2025-05-07T20:32:57.5569887Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.5570402Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.5570917Z                            module_map=module_map)
2025-05-07T20:32:57.5571282Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.5571637Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:57.5571888Z E       ^
2025-05-07T20:32:57.5572353Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.5572818Z 
2025-05-07T20:32:57.5573246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.5573752Z 
2025-05-07T20:32:57.5573860Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.5574264Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.5574658Z     T=1,
2025-05-07T20:32:57.5574841Z     D=7168,
2025-05-07T20:32:57.5575024Z     scale_ub=1200.0,
2025-05-07T20:32:57.5575243Z     contiguous=False,
2025-05-07T20:32:57.5575465Z     compiled=True,
2025-05-07T20:32:57.5575675Z )
2025-05-07T20:32:57.5575991Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.5576468Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:57.5576724Z 
2025-05-07T20:32:57.5576808Z     @given(
2025-05-07T20:32:57.5577030Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.5577340Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.5577647Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.5578047Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.5578376Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.5578664Z     )
2025-05-07T20:32:57.5579004Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.5579443Z     def test_silu_mul_quant(
2025-05-07T20:32:57.5579686Z         self,
2025-05-07T20:32:57.5579879Z         T: int,
2025-05-07T20:32:57.5580070Z         D: int,
2025-05-07T20:32:57.5580284Z         scale_ub: Optional[float],
2025-05-07T20:32:57.5580547Z         contiguous: bool,
2025-05-07T20:32:57.5580776Z         compiled: bool,
2025-05-07T20:32:57.5580990Z     ) -> None:
2025-05-07T20:32:57.5581201Z         torch.manual_seed(2025)
2025-05-07T20:32:57.5581435Z     
2025-05-07T20:32:57.5581703Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.5582048Z     
2025-05-07T20:32:57.5582229Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.5582516Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.5582827Z         x = x_sign * x_clamp
2025-05-07T20:32:57.5583112Z         x0 = x[:, :D]
2025-05-07T20:32:57.5583316Z         x1 = x[:, D:]
2025-05-07T20:32:57.5583514Z     
2025-05-07T20:32:57.5583689Z         if contiguous:
2025-05-07T20:32:57.5583911Z             x0 = x0.contiguous()
2025-05-07T20:32:57.5584161Z             x1 = x1.contiguous()
2025-05-07T20:32:57.5584393Z     
2025-05-07T20:32:57.5584573Z         if scale_ub is not None:
2025-05-07T20:32:57.5584848Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.5585176Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.5585482Z             )
2025-05-07T20:32:57.5585674Z         else:
2025-05-07T20:32:57.5585885Z             scale_ub_tensor = None
2025-05-07T20:32:57.5586120Z     
2025-05-07T20:32:57.5586394Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.5586697Z             op = silu_mul_quant
2025-05-07T20:32:57.5586938Z             if compiled:
2025-05-07T20:32:57.5587192Z                 op = torch.compile(op)
2025-05-07T20:32:57.5587538Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.5587802Z     
2025-05-07T20:32:57.5587989Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:57.5588147Z 
2025-05-07T20:32:57.5588252Z moe/activation_test.py:117: 
2025-05-07T20:32:57.5588541Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.5588858Z moe/activation_test.py:115: in fn
2025-05-07T20:32:57.5589137Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.5589709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:57.5590251Z     return fn(*args, **kwargs)
2025-05-07T20:32:57.5590900Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:57.5591580Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:57.5592128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.5592792Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.5593443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.5593960Z     kernel = self.compile(
2025-05-07T20:32:57.5594489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.5595158Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.5595544Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.5595765Z 
2025-05-07T20:32:57.5595972Z self = <triton.compiler.compiler.ASTSource object at 0x7f3788a66e50>
2025-05-07T20:32:57.5597107Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.5598458Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3789f0eb60>}
2025-05-07T20:32:57.5599810Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.5600899Z context = <triton._C.libtriton.ir.context object at 0x7f35d7c5ba70>
2025-05-07T20:32:57.5601187Z 
2025-05-07T20:32:57.5601355Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.5601862Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.5602324Z                            module_map=module_map)
2025-05-07T20:32:57.5602693Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.5603084Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:57.5603336Z E       ^
2025-05-07T20:32:57.5603788Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.5604233Z 
2025-05-07T20:32:57.5604653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.7740956Z 
2025-05-07T20:32:57.7741212Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.7742341Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.7743335Z     T=1,
2025-05-07T20:32:57.7743791Z     D=7168,
2025-05-07T20:32:57.7744261Z     scale_ub=None,
2025-05-07T20:32:57.7744999Z     contiguous=False,
2025-05-07T20:32:57.7745407Z     compiled=True,
2025-05-07T20:32:57.7753820Z )
2025-05-07T20:32:57.7754163Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.7754669Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:57.7754931Z 
2025-05-07T20:32:57.7755004Z     @given(
2025-05-07T20:32:57.7755223Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.7755522Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.7755814Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.7756134Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.7756453Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.7756724Z     )
2025-05-07T20:32:57.7757061Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.7757505Z     def test_silu_mul_quant(
2025-05-07T20:32:57.7757740Z         self,
2025-05-07T20:32:57.7757918Z         T: int,
2025-05-07T20:32:57.7758111Z         D: int,
2025-05-07T20:32:57.7758320Z         scale_ub: Optional[float],
2025-05-07T20:32:57.7758586Z         contiguous: bool,
2025-05-07T20:32:57.7758831Z         compiled: bool,
2025-05-07T20:32:57.7759054Z     ) -> None:
2025-05-07T20:32:57.7759257Z         torch.manual_seed(2025)
2025-05-07T20:32:57.7759487Z     
2025-05-07T20:32:57.7759752Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.7760076Z     
2025-05-07T20:32:57.7760269Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.7760552Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.7760844Z         x = x_sign * x_clamp
2025-05-07T20:32:57.7761072Z         x0 = x[:, :D]
2025-05-07T20:32:57.7761280Z         x1 = x[:, D:]
2025-05-07T20:32:57.7761479Z     
2025-05-07T20:32:57.7761648Z         if contiguous:
2025-05-07T20:32:57.7761867Z             x0 = x0.contiguous()
2025-05-07T20:32:57.7762122Z             x1 = x1.contiguous()
2025-05-07T20:32:57.7762348Z     
2025-05-07T20:32:57.7762527Z         if scale_ub is not None:
2025-05-07T20:32:57.7762953Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.7763277Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.7763578Z             )
2025-05-07T20:32:57.7763764Z         else:
2025-05-07T20:32:57.7763959Z             scale_ub_tensor = None
2025-05-07T20:32:57.7764197Z     
2025-05-07T20:32:57.7764418Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.7764716Z             op = silu_mul_quant
2025-05-07T20:32:57.7764950Z             if compiled:
2025-05-07T20:32:57.7765190Z                 op = torch.compile(op)
2025-05-07T20:32:57.7765472Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.7765734Z     
2025-05-07T20:32:57.7765916Z         y_fp8, y_scale = fn()
2025-05-07T20:32:57.7766192Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:32:57.7766463Z     
2025-05-07T20:32:57.7766685Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.7767010Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:32:57.7767285Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:32:57.7767656Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:32:57.7767999Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:57.7768293Z     
2025-05-07T20:32:57.7768480Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:32:57.7768666Z 
2025-05-07T20:32:57.7768765Z moe/activation_test.py:126: 
2025-05-07T20:32:57.7769046Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.7769366Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:32:57.7769681Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:32:57.7770449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:32:57.7771223Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:32:57.7771766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.7772447Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.7773118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:32:57.7773819Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:32:57.7774548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:32:57.7775172Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:32:57.7775775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:32:57.7776283Z     fn()
2025-05-07T20:32:57.7776791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:32:57.7777378Z     self.fn.run(
2025-05-07T20:32:57.7777833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.7778365Z     kernel = self.compile(
2025-05-07T20:32:57.7778912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.7779562Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.7779941Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.7780162Z 
2025-05-07T20:32:57.7780364Z self = <triton.compiler.compiler.ASTSource object at 0x7f3788a66a50>
2025-05-07T20:32:57.7781470Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.7782994Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3788dd3920>}
2025-05-07T20:32:57.7784325Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.7785373Z context = <triton._C.libtriton.ir.context object at 0x7f35d7e9e5f0>
2025-05-07T20:32:57.7785658Z 
2025-05-07T20:32:57.7785818Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.7786324Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.7786778Z                            module_map=module_map)
2025-05-07T20:32:57.7787138Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.7787542Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:32:57.7787800Z E       ^
2025-05-07T20:32:57.7788257Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.7788753Z 
2025-05-07T20:32:57.7789170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.7789667Z 
2025-05-07T20:32:57.7789766Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.7790160Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.7790561Z     T=1,
2025-05-07T20:32:57.7790735Z     D=5120,
2025-05-07T20:32:57.7790915Z     scale_ub=1200.0,
2025-05-07T20:32:57.7791130Z     contiguous=False,
2025-05-07T20:32:57.7791344Z     compiled=True,
2025-05-07T20:32:57.7791530Z )
2025-05-07T20:32:57.7791955Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.7792428Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:57.7792686Z 
2025-05-07T20:32:57.7792767Z     @given(
2025-05-07T20:32:57.7792982Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.7793280Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.7793574Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.7793884Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.7794195Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.7794471Z     )
2025-05-07T20:32:57.7794803Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.7795238Z     def test_silu_mul_quant(
2025-05-07T20:32:57.7795471Z         self,
2025-05-07T20:32:57.7795652Z         T: int,
2025-05-07T20:32:57.7795837Z         D: int,
2025-05-07T20:32:57.7796046Z         scale_ub: Optional[float],
2025-05-07T20:32:57.7796303Z         contiguous: bool,
2025-05-07T20:32:57.7796524Z         compiled: bool,
2025-05-07T20:32:57.7796737Z     ) -> None:
2025-05-07T20:32:57.7796943Z         torch.manual_seed(2025)
2025-05-07T20:32:57.7797173Z     
2025-05-07T20:32:57.7797430Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.7797755Z     
2025-05-07T20:32:57.7797931Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.7798206Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.7798502Z         x = x_sign * x_clamp
2025-05-07T20:32:57.7798725Z         x0 = x[:, :D]
2025-05-07T20:32:57.7798928Z         x1 = x[:, D:]
2025-05-07T20:32:57.7799129Z     
2025-05-07T20:32:57.7799299Z         if contiguous:
2025-05-07T20:32:57.7799517Z             x0 = x0.contiguous()
2025-05-07T20:32:57.7799760Z             x1 = x1.contiguous()
2025-05-07T20:32:57.7799982Z     
2025-05-07T20:32:57.7800160Z         if scale_ub is not None:
2025-05-07T20:32:57.7800425Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.7800741Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.7801111Z             )
2025-05-07T20:32:57.7801294Z         else:
2025-05-07T20:32:57.7801493Z             scale_ub_tensor = None
2025-05-07T20:32:57.7801726Z     
2025-05-07T20:32:57.7801940Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.7802234Z             op = silu_mul_quant
2025-05-07T20:32:57.7802471Z             if compiled:
2025-05-07T20:32:57.7802705Z                 op = torch.compile(op)
2025-05-07T20:32:57.7802993Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.7803247Z     
2025-05-07T20:32:57.7803431Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:57.7803585Z 
2025-05-07T20:32:57.7803679Z moe/activation_test.py:117: 
2025-05-07T20:32:57.7803953Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.7804266Z moe/activation_test.py:115: in fn
2025-05-07T20:32:57.7804536Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.7805094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:57.7805700Z     return fn(*args, **kwargs)
2025-05-07T20:32:57.7806345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:57.7807006Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:57.7807525Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.7808183Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.7808827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.7809338Z     kernel = self.compile(
2025-05-07T20:32:57.7809864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.7810571Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.7811001Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.7811222Z 
2025-05-07T20:32:57.7811426Z self = <triton.compiler.compiler.ASTSource object at 0x7f35d7c294d0>
2025-05-07T20:32:57.7812470Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.7813806Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3788dd2520>}
2025-05-07T20:32:57.7815156Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.7816209Z context = <triton._C.libtriton.ir.context object at 0x7f35d7e1f2b0>
2025-05-07T20:32:57.7816487Z 
2025-05-07T20:32:57.7816644Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.7817149Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.7817602Z                            module_map=module_map)
2025-05-07T20:32:57.7817956Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.7818289Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:57.7818532Z E       ^
2025-05-07T20:32:57.7818983Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.7819429Z 
2025-05-07T20:32:57.7819839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.9224114Z 
2025-05-07T20:32:57.9224549Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.9225341Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.9226003Z     T=1,
2025-05-07T20:32:57.9226246Z     D=5120,
2025-05-07T20:32:57.9226503Z     scale_ub=1200.0,
2025-05-07T20:32:57.9226793Z     contiguous=False,
2025-05-07T20:32:57.9227016Z     compiled=False,
2025-05-07T20:32:57.9227229Z )
2025-05-07T20:32:57.9227605Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.9228162Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:57.9228466Z 
2025-05-07T20:32:57.9228550Z     @given(
2025-05-07T20:32:57.9228784Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.9229082Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.9229376Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.9229703Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.9230031Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.9230321Z     )
2025-05-07T20:32:57.9230661Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.9231215Z     def test_silu_mul_quant(
2025-05-07T20:32:57.9231460Z         self,
2025-05-07T20:32:57.9231649Z         T: int,
2025-05-07T20:32:57.9231830Z         D: int,
2025-05-07T20:32:57.9232036Z         scale_ub: Optional[float],
2025-05-07T20:32:57.9232316Z         contiguous: bool,
2025-05-07T20:32:57.9232556Z         compiled: bool,
2025-05-07T20:32:57.9232787Z     ) -> None:
2025-05-07T20:32:57.9233010Z         torch.manual_seed(2025)
2025-05-07T20:32:57.9233251Z     
2025-05-07T20:32:57.9233541Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.9233887Z     
2025-05-07T20:32:57.9234085Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.9234445Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.9234761Z         x = x_sign * x_clamp
2025-05-07T20:32:57.9235014Z         x0 = x[:, :D]
2025-05-07T20:32:57.9235234Z         x1 = x[:, D:]
2025-05-07T20:32:57.9235452Z     
2025-05-07T20:32:57.9235640Z         if contiguous:
2025-05-07T20:32:57.9235864Z             x0 = x0.contiguous()
2025-05-07T20:32:57.9236127Z             x1 = x1.contiguous()
2025-05-07T20:32:57.9236371Z     
2025-05-07T20:32:57.9236563Z         if scale_ub is not None:
2025-05-07T20:32:57.9236836Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.9237175Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.9237487Z             )
2025-05-07T20:32:57.9237685Z         else:
2025-05-07T20:32:57.9237889Z             scale_ub_tensor = None
2025-05-07T20:32:57.9238138Z     
2025-05-07T20:32:57.9238365Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.9238681Z             op = silu_mul_quant
2025-05-07T20:32:57.9238944Z             if compiled:
2025-05-07T20:32:57.9239190Z                 op = torch.compile(op)
2025-05-07T20:32:57.9239511Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.9239805Z     
2025-05-07T20:32:57.9239990Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:57.9240331Z 
2025-05-07T20:32:57.9240432Z moe/activation_test.py:117: 
2025-05-07T20:32:57.9240754Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.9241105Z moe/activation_test.py:115: in fn
2025-05-07T20:32:57.9241388Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.9242084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:57.9242768Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:57.9243316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.9243987Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.9244774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.9245306Z     kernel = self.compile(
2025-05-07T20:32:57.9245853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.9246502Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.9246898Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.9247120Z 
2025-05-07T20:32:57.9247325Z self = <triton.compiler.compiler.ASTSource object at 0x7f378910dad0>
2025-05-07T20:32:57.9248395Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.9249761Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37890ca660>}
2025-05-07T20:32:57.9251181Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.9252247Z context = <triton._C.libtriton.ir.context object at 0x7f35d7e77df0>
2025-05-07T20:32:57.9252531Z 
2025-05-07T20:32:57.9252698Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.9253212Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.9253678Z                            module_map=module_map)
2025-05-07T20:32:57.9254030Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.9254442Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:57.9254698Z E       ^
2025-05-07T20:32:57.9255155Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.9255609Z 
2025-05-07T20:32:57.9256025Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.9256534Z 
2025-05-07T20:32:57.9256633Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.9257036Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.9257422Z     T=16384,
2025-05-07T20:32:57.9257615Z     D=5120,
2025-05-07T20:32:57.9257795Z     scale_ub=1200.0,
2025-05-07T20:32:57.9258005Z     contiguous=False,
2025-05-07T20:32:57.9258230Z     compiled=True,
2025-05-07T20:32:57.9258426Z )
2025-05-07T20:32:57.9258733Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:57.9259221Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:57.9259507Z 
2025-05-07T20:32:57.9259586Z     @given(
2025-05-07T20:32:57.9259827Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:57.9260132Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:57.9260434Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:57.9260761Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:57.9261073Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:57.9261344Z     )
2025-05-07T20:32:57.9261677Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:57.9262113Z     def test_silu_mul_quant(
2025-05-07T20:32:57.9262341Z         self,
2025-05-07T20:32:57.9262523Z         T: int,
2025-05-07T20:32:57.9262706Z         D: int,
2025-05-07T20:32:57.9262908Z         scale_ub: Optional[float],
2025-05-07T20:32:57.9263170Z         contiguous: bool,
2025-05-07T20:32:57.9263406Z         compiled: bool,
2025-05-07T20:32:57.9263615Z     ) -> None:
2025-05-07T20:32:57.9263819Z         torch.manual_seed(2025)
2025-05-07T20:32:57.9264050Z     
2025-05-07T20:32:57.9264391Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:57.9264739Z     
2025-05-07T20:32:57.9264935Z         x_sign = torch.sign(x)
2025-05-07T20:32:57.9265215Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:57.9265527Z         x = x_sign * x_clamp
2025-05-07T20:32:57.9265757Z         x0 = x[:, :D]
2025-05-07T20:32:57.9265965Z         x1 = x[:, D:]
2025-05-07T20:32:57.9266171Z     
2025-05-07T20:32:57.9266352Z         if contiguous:
2025-05-07T20:32:57.9266569Z             x0 = x0.contiguous()
2025-05-07T20:32:57.9266816Z             x1 = x1.contiguous()
2025-05-07T20:32:57.9267046Z     
2025-05-07T20:32:57.9267232Z         if scale_ub is not None:
2025-05-07T20:32:57.9267563Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:57.9267934Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:57.9268277Z             )
2025-05-07T20:32:57.9268469Z         else:
2025-05-07T20:32:57.9268684Z             scale_ub_tensor = None
2025-05-07T20:32:57.9268960Z     
2025-05-07T20:32:57.9269243Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:57.9269547Z             op = silu_mul_quant
2025-05-07T20:32:57.9269786Z             if compiled:
2025-05-07T20:32:57.9270024Z                 op = torch.compile(op)
2025-05-07T20:32:57.9270309Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.9270573Z     
2025-05-07T20:32:57.9270757Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:57.9270921Z 
2025-05-07T20:32:57.9271017Z moe/activation_test.py:117: 
2025-05-07T20:32:57.9271305Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.9271623Z moe/activation_test.py:115: in fn
2025-05-07T20:32:57.9271891Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:57.9272456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:57.9273046Z     return fn(*args, **kwargs)
2025-05-07T20:32:57.9273695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:57.9274368Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:57.9274908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:57.9275568Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:57.9276211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:57.9276728Z     kernel = self.compile(
2025-05-07T20:32:57.9277278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:57.9277937Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:57.9278320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:57.9278547Z 
2025-05-07T20:32:57.9278746Z self = <triton.compiler.compiler.ASTSource object at 0x7f3788b97f50>
2025-05-07T20:32:57.9279803Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:57.9281142Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37890c8720>}
2025-05-07T20:32:57.9282508Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:57.9283504Z context = <triton._C.libtriton.ir.context object at 0x7f378848a130>
2025-05-07T20:32:57.9283786Z 
2025-05-07T20:32:57.9284027Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:57.9284545Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:57.9285003Z                            module_map=module_map)
2025-05-07T20:32:57.9285368Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:57.9285711Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:57.9285959Z E       ^
2025-05-07T20:32:57.9286408Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:57.9286850Z 
2025-05-07T20:32:57.9287262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:57.9287764Z 
2025-05-07T20:32:57.9287873Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:57.9288279Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:57.9288669Z     T=2048,
2025-05-07T20:32:57.9288858Z     D=7168,
2025-05-07T20:32:57.9289082Z     scale_ub=1200.0,
2025-05-07T20:32:57.9289301Z     contiguous=False,
2025-05-07T20:32:57.9289518Z     compiled=True,
2025-05-07T20:32:58.1160895Z )
2025-05-07T20:32:58.1161454Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.1162031Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:58.1162422Z 
2025-05-07T20:32:58.1162533Z     @given(
2025-05-07T20:32:58.1163033Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.1163390Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.1163706Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.1164027Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.1164488Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.1164766Z     )
2025-05-07T20:32:58.1165115Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.1165565Z     def test_silu_mul_quant(
2025-05-07T20:32:58.1165817Z         self,
2025-05-07T20:32:58.1166002Z         T: int,
2025-05-07T20:32:58.1166268Z         D: int,
2025-05-07T20:32:58.1166567Z         scale_ub: Optional[float],
2025-05-07T20:32:58.1166834Z         contiguous: bool,
2025-05-07T20:32:58.1167067Z         compiled: bool,
2025-05-07T20:32:58.1167304Z     ) -> None:
2025-05-07T20:32:58.1167519Z         torch.manual_seed(2025)
2025-05-07T20:32:58.1167759Z     
2025-05-07T20:32:58.1173985Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.1174334Z     
2025-05-07T20:32:58.1174515Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.1174798Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.1175099Z         x = x_sign * x_clamp
2025-05-07T20:32:58.1175333Z         x0 = x[:, :D]
2025-05-07T20:32:58.1175534Z         x1 = x[:, D:]
2025-05-07T20:32:58.1175724Z     
2025-05-07T20:32:58.1175898Z         if contiguous:
2025-05-07T20:32:58.1176123Z             x0 = x0.contiguous()
2025-05-07T20:32:58.1176374Z             x1 = x1.contiguous()
2025-05-07T20:32:58.1176601Z     
2025-05-07T20:32:58.1176783Z         if scale_ub is not None:
2025-05-07T20:32:58.1177044Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.1177364Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.1177667Z             )
2025-05-07T20:32:58.1177845Z         else:
2025-05-07T20:32:58.1178041Z             scale_ub_tensor = None
2025-05-07T20:32:58.1178286Z     
2025-05-07T20:32:58.1178505Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.1178797Z             op = silu_mul_quant
2025-05-07T20:32:58.1179071Z             if compiled:
2025-05-07T20:32:58.1179309Z                 op = torch.compile(op)
2025-05-07T20:32:58.1179604Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.1179871Z     
2025-05-07T20:32:58.1180052Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.1180389Z 
2025-05-07T20:32:58.1180535Z moe/activation_test.py:117: 
2025-05-07T20:32:58.1180837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.1181164Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.1181432Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.1182001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:58.1182572Z     return fn(*args, **kwargs)
2025-05-07T20:32:58.1183213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.1183881Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.1184413Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.1185088Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.1185738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.1186319Z     kernel = self.compile(
2025-05-07T20:32:58.1186918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.1187681Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.1188062Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.1188281Z 
2025-05-07T20:32:58.1188487Z self = <triton.compiler.compiler.ASTSource object at 0x7f378910f650>
2025-05-07T20:32:58.1189537Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.1191074Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3788de1b20>}
2025-05-07T20:32:58.1192390Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.1193424Z context = <triton._C.libtriton.ir.context object at 0x7f378849f8b0>
2025-05-07T20:32:58.1193699Z 
2025-05-07T20:32:58.1193856Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.1194361Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.1194810Z                            module_map=module_map)
2025-05-07T20:32:58.1195163Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.1195504Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.1195753Z E       ^
2025-05-07T20:32:58.1196218Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.1196676Z 
2025-05-07T20:32:58.1197108Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.1197614Z 
2025-05-07T20:32:58.1197712Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.1198113Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.1198545Z     T=1,
2025-05-07T20:32:58.1198715Z     D=5120,
2025-05-07T20:32:58.1198902Z     scale_ub=None,
2025-05-07T20:32:58.1199107Z     contiguous=False,
2025-05-07T20:32:58.1199319Z     compiled=False,
2025-05-07T20:32:58.1199513Z )
2025-05-07T20:32:58.1199819Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.1200291Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:58.1200546Z 
2025-05-07T20:32:58.1200704Z     @given(
2025-05-07T20:32:58.1200925Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.1201227Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.1201511Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.1201820Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.1202129Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.1202395Z     )
2025-05-07T20:32:58.1202732Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.1203152Z     def test_silu_mul_quant(
2025-05-07T20:32:58.1203417Z         self,
2025-05-07T20:32:58.1203601Z         T: int,
2025-05-07T20:32:58.1203783Z         D: int,
2025-05-07T20:32:58.1203993Z         scale_ub: Optional[float],
2025-05-07T20:32:58.1204271Z         contiguous: bool,
2025-05-07T20:32:58.1204523Z         compiled: bool,
2025-05-07T20:32:58.1204770Z     ) -> None:
2025-05-07T20:32:58.1204964Z         torch.manual_seed(2025)
2025-05-07T20:32:58.1205198Z     
2025-05-07T20:32:58.1205498Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.1205822Z     
2025-05-07T20:32:58.1205990Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.1206262Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.1206559Z         x = x_sign * x_clamp
2025-05-07T20:32:58.1206821Z         x0 = x[:, :D]
2025-05-07T20:32:58.1207019Z         x1 = x[:, D:]
2025-05-07T20:32:58.1207213Z     
2025-05-07T20:32:58.1207378Z         if contiguous:
2025-05-07T20:32:58.1207598Z             x0 = x0.contiguous()
2025-05-07T20:32:58.1207843Z             x1 = x1.contiguous()
2025-05-07T20:32:58.1208064Z     
2025-05-07T20:32:58.1208282Z         if scale_ub is not None:
2025-05-07T20:32:58.1208540Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.1208906Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.1209199Z             )
2025-05-07T20:32:58.1209385Z         else:
2025-05-07T20:32:58.1209579Z             scale_ub_tensor = None
2025-05-07T20:32:58.1209819Z     
2025-05-07T20:32:58.1210037Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.1210338Z             op = silu_mul_quant
2025-05-07T20:32:58.1210575Z             if compiled:
2025-05-07T20:32:58.1210812Z                 op = torch.compile(op)
2025-05-07T20:32:58.1211104Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.1211356Z     
2025-05-07T20:32:58.1211535Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.1211691Z 
2025-05-07T20:32:58.1211790Z moe/activation_test.py:117: 
2025-05-07T20:32:58.1212065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.1212383Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.1212645Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.1213323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.1214065Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.1214622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.1215324Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.1215966Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.1216478Z     kernel = self.compile(
2025-05-07T20:32:58.1217029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.1217670Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.1218084Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.1218311Z 
2025-05-07T20:32:58.1218509Z self = <triton.compiler.compiler.ASTSource object at 0x7f35d7c28fd0>
2025-05-07T20:32:58.1219650Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.1221056Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3788de0ae0>}
2025-05-07T20:32:58.1222366Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.1223360Z context = <triton._C.libtriton.ir.context object at 0x7f37889bf1b0>
2025-05-07T20:32:58.1223647Z 
2025-05-07T20:32:58.1223805Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.1224314Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.1224860Z                            module_map=module_map)
2025-05-07T20:32:58.1225209Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.1225548Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.1225789Z E       ^
2025-05-07T20:32:58.1226239Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.1226702Z 
2025-05-07T20:32:58.1227111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.1227660Z 
2025-05-07T20:32:58.1227763Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.1228153Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.1228589Z     T=4096,
2025-05-07T20:32:58.1228771Z     D=7168,
2025-05-07T20:32:58.1228993Z     scale_ub=1200.0,
2025-05-07T20:32:58.1229205Z     contiguous=False,
2025-05-07T20:32:58.1229426Z     compiled=False,
2025-05-07T20:32:58.1229627Z )
2025-05-07T20:32:58.1229930Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.1230412Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:58.1230684Z 
2025-05-07T20:32:58.1230776Z     @given(
2025-05-07T20:32:58.1231025Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.1231325Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.1231619Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.1231927Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.1232242Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.1232511Z     )
2025-05-07T20:32:58.1232846Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.1233268Z     def test_silu_mul_quant(
2025-05-07T20:32:58.1233496Z         self,
2025-05-07T20:32:58.1233682Z         T: int,
2025-05-07T20:32:58.1233861Z         D: int,
2025-05-07T20:32:58.1234069Z         scale_ub: Optional[float],
2025-05-07T20:32:58.1234326Z         contiguous: bool,
2025-05-07T20:32:58.1234547Z         compiled: bool,
2025-05-07T20:32:58.1234758Z     ) -> None:
2025-05-07T20:32:58.1234961Z         torch.manual_seed(2025)
2025-05-07T20:32:58.1235188Z     
2025-05-07T20:32:58.1235449Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.1235776Z     
2025-05-07T20:32:58.1235958Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.1236232Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.1236553Z         x = x_sign * x_clamp
2025-05-07T20:32:58.1236868Z         x0 = x[:, :D]
2025-05-07T20:32:58.1237242Z         x1 = x[:, D:]
2025-05-07T20:32:58.1237449Z     
2025-05-07T20:32:58.1237619Z         if contiguous:
2025-05-07T20:32:58.1237831Z             x0 = x0.contiguous()
2025-05-07T20:32:58.1238173Z             x1 = x1.contiguous()
2025-05-07T20:32:58.1238400Z     
2025-05-07T20:32:58.1238577Z         if scale_ub is not None:
2025-05-07T20:32:58.1238834Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.1239154Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.1239450Z             )
2025-05-07T20:32:58.1239631Z         else:
2025-05-07T20:32:58.1239829Z             scale_ub_tensor = None
2025-05-07T20:32:58.1240333Z     
2025-05-07T20:32:58.1240557Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.1240853Z             op = silu_mul_quant
2025-05-07T20:32:58.1241083Z             if compiled:
2025-05-07T20:32:58.1241324Z                 op = torch.compile(op)
2025-05-07T20:32:58.1241611Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.1241861Z     
2025-05-07T20:32:58.1242047Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.1242210Z 
2025-05-07T20:32:58.1242303Z moe/activation_test.py:117: 
2025-05-07T20:32:58.1242590Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.1243019Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.1243292Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.1243971Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.1244643Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.1245175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.1245900Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.1246554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.1247145Z     kernel = self.compile(
2025-05-07T20:32:58.1247683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.1248325Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.1248709Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.1248985Z 
2025-05-07T20:32:58.1249183Z self = <triton.compiler.compiler.ASTSource object at 0x7f37889daf50>
2025-05-07T20:32:58.1250242Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.1251660Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3788f9b880>}
2025-05-07T20:32:58.1252980Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.1253978Z context = <triton._C.libtriton.ir.context object at 0x7f378899fa70>
2025-05-07T20:32:58.1254262Z 
2025-05-07T20:32:58.1254424Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.1254934Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.1255395Z                            module_map=module_map)
2025-05-07T20:32:58.1255744Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.1256096Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.1256346Z E       ^
2025-05-07T20:32:58.1256793Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.1257264Z 
2025-05-07T20:32:58.1257677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.2847436Z 
2025-05-07T20:32:58.2848148Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.2848778Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.2849339Z     T=16384,
2025-05-07T20:32:58.2849609Z     D=7168,
2025-05-07T20:32:58.2849863Z     scale_ub=None,
2025-05-07T20:32:58.2850133Z     contiguous=True,
2025-05-07T20:32:58.2850416Z     compiled=True,
2025-05-07T20:32:58.2850678Z )
2025-05-07T20:32:58.2851119Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.2851615Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:32:58.2851888Z 
2025-05-07T20:32:58.2851969Z     @given(
2025-05-07T20:32:58.2852192Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.2852508Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.2852823Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.2853147Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.2853480Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.2853877Z     )
2025-05-07T20:32:58.2854257Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.2854718Z     def test_silu_mul_quant(
2025-05-07T20:32:58.2854953Z         self,
2025-05-07T20:32:58.2855147Z         T: int,
2025-05-07T20:32:58.2855350Z         D: int,
2025-05-07T20:32:58.2855557Z         scale_ub: Optional[float],
2025-05-07T20:32:58.2855824Z         contiguous: bool,
2025-05-07T20:32:58.2856063Z         compiled: bool,
2025-05-07T20:32:58.2856282Z     ) -> None:
2025-05-07T20:32:58.2856488Z         torch.manual_seed(2025)
2025-05-07T20:32:58.2856722Z     
2025-05-07T20:32:58.2856993Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.2857423Z     
2025-05-07T20:32:58.2857614Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.2857903Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.2858205Z         x = x_sign * x_clamp
2025-05-07T20:32:58.2858452Z         x0 = x[:, :D]
2025-05-07T20:32:58.2858666Z         x1 = x[:, D:]
2025-05-07T20:32:58.2858859Z     
2025-05-07T20:32:58.2859045Z         if contiguous:
2025-05-07T20:32:58.2859279Z             x0 = x0.contiguous()
2025-05-07T20:32:58.2859526Z             x1 = x1.contiguous()
2025-05-07T20:32:58.2859761Z     
2025-05-07T20:32:58.2859947Z         if scale_ub is not None:
2025-05-07T20:32:58.2860210Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.2860543Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.2860857Z             )
2025-05-07T20:32:58.2861052Z         else:
2025-05-07T20:32:58.2861251Z             scale_ub_tensor = None
2025-05-07T20:32:58.2861503Z     
2025-05-07T20:32:58.2861732Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.2862038Z             op = silu_mul_quant
2025-05-07T20:32:58.2862284Z             if compiled:
2025-05-07T20:32:58.2862532Z                 op = torch.compile(op)
2025-05-07T20:32:58.2862824Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.2863094Z     
2025-05-07T20:32:58.2863282Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.2863443Z 
2025-05-07T20:32:58.2863540Z moe/activation_test.py:117: 
2025-05-07T20:32:58.2863829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.2864156Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.2864425Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.2864975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:58.2865528Z     return fn(*args, **kwargs)
2025-05-07T20:32:58.2866182Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.2866859Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.2867594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.2868278Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.2868958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.2869480Z     kernel = self.compile(
2025-05-07T20:32:58.2870033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.2870680Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.2871064Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.2871292Z 
2025-05-07T20:32:58.2871494Z self = <triton.compiler.compiler.ASTSource object at 0x7f3788e29e50>
2025-05-07T20:32:58.2872570Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.2873974Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3788f9a980>}
2025-05-07T20:32:58.2875293Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.2876294Z context = <triton._C.libtriton.ir.context object at 0x7f3788942a70>
2025-05-07T20:32:58.2876582Z 
2025-05-07T20:32:58.2876746Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.2877268Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.2877776Z                            module_map=module_map)
2025-05-07T20:32:58.2878141Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.2878494Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.2878750Z E       ^
2025-05-07T20:32:58.2879205Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.2879660Z 
2025-05-07T20:32:58.2880074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.2880581Z 
2025-05-07T20:32:58.2880681Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.2881083Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.2881465Z     T=4096,
2025-05-07T20:32:58.2881647Z     D=5120,
2025-05-07T20:32:58.2881839Z     scale_ub=None,
2025-05-07T20:32:58.2882050Z     contiguous=False,
2025-05-07T20:32:58.2882276Z     compiled=True,
2025-05-07T20:32:58.2882474Z )
2025-05-07T20:32:58.2882787Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.2883274Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:58.2883547Z 
2025-05-07T20:32:58.2883623Z     @given(
2025-05-07T20:32:58.2883849Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.2884152Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.2884462Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.2884797Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.2885122Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.2885413Z     )
2025-05-07T20:32:58.2885762Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.2886196Z     def test_silu_mul_quant(
2025-05-07T20:32:58.2886435Z         self,
2025-05-07T20:32:58.2886631Z         T: int,
2025-05-07T20:32:58.2886833Z         D: int,
2025-05-07T20:32:58.2887042Z         scale_ub: Optional[float],
2025-05-07T20:32:58.2887395Z         contiguous: bool,
2025-05-07T20:32:58.2887646Z         compiled: bool,
2025-05-07T20:32:58.2887864Z     ) -> None:
2025-05-07T20:32:58.2888084Z         torch.manual_seed(2025)
2025-05-07T20:32:58.2888333Z     
2025-05-07T20:32:58.2888596Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.2888933Z     
2025-05-07T20:32:58.2889125Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.2889405Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.2889725Z         x = x_sign * x_clamp
2025-05-07T20:32:58.2889975Z         x0 = x[:, :D]
2025-05-07T20:32:58.2890183Z         x1 = x[:, D:]
2025-05-07T20:32:58.2890389Z     
2025-05-07T20:32:58.2890575Z         if contiguous:
2025-05-07T20:32:58.2890793Z             x0 = x0.contiguous()
2025-05-07T20:32:58.2891056Z             x1 = x1.contiguous()
2025-05-07T20:32:58.2891298Z     
2025-05-07T20:32:58.2891478Z         if scale_ub is not None:
2025-05-07T20:32:58.2891760Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.2892140Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.2892444Z             )
2025-05-07T20:32:58.2892633Z         else:
2025-05-07T20:32:58.2892841Z             scale_ub_tensor = None
2025-05-07T20:32:58.2893096Z     
2025-05-07T20:32:58.2893318Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.2893625Z             op = silu_mul_quant
2025-05-07T20:32:58.2893866Z             if compiled:
2025-05-07T20:32:58.2904212Z                 op = torch.compile(op)
2025-05-07T20:32:58.2904530Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.2904809Z     
2025-05-07T20:32:58.2905005Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.2905177Z 
2025-05-07T20:32:58.2905280Z moe/activation_test.py:117: 
2025-05-07T20:32:58.2905665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.2906007Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.2906298Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.2906876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:58.2907493Z     return fn(*args, **kwargs)
2025-05-07T20:32:58.2908161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.2908838Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.2909380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.2910059Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.2910740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.2911309Z     kernel = self.compile(
2025-05-07T20:32:58.2911858Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.2912509Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.2912917Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.2913139Z 
2025-05-07T20:32:58.2913342Z self = <triton.compiler.compiler.ASTSource object at 0x7f37ae3c2cd0>
2025-05-07T20:32:58.2914410Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.2915765Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37885a3920>}
2025-05-07T20:32:58.2917175Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.2918192Z context = <triton._C.libtriton.ir.context object at 0x7f35d793ceb0>
2025-05-07T20:32:58.2918482Z 
2025-05-07T20:32:58.2918645Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.2919173Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.2919643Z                            module_map=module_map)
2025-05-07T20:32:58.2919998Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.2920353Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.2920617Z E       ^
2025-05-07T20:32:58.2921072Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.2921531Z 
2025-05-07T20:32:58.2921946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.4316680Z 
2025-05-07T20:32:58.4317011Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.4317493Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.4318086Z     T=4096,
2025-05-07T20:32:58.4318278Z     D=5120,
2025-05-07T20:32:58.4318481Z     scale_ub=1200.0,
2025-05-07T20:32:58.4318721Z     contiguous=False,
2025-05-07T20:32:58.4318945Z     compiled=False,
2025-05-07T20:32:58.4319160Z )
2025-05-07T20:32:58.4319479Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.4319972Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:58.4320246Z 
2025-05-07T20:32:58.4320326Z     @given(
2025-05-07T20:32:58.4320555Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.4321135Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.4321444Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.4321792Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.4322133Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.4322410Z     )
2025-05-07T20:32:58.4322769Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.4323226Z     def test_silu_mul_quant(
2025-05-07T20:32:58.4323473Z         self,
2025-05-07T20:32:58.4323660Z         T: int,
2025-05-07T20:32:58.4323867Z         D: int,
2025-05-07T20:32:58.4324086Z         scale_ub: Optional[float],
2025-05-07T20:32:58.4324356Z         contiguous: bool,
2025-05-07T20:32:58.4324597Z         compiled: bool,
2025-05-07T20:32:58.4324823Z     ) -> None:
2025-05-07T20:32:58.4325031Z         torch.manual_seed(2025)
2025-05-07T20:32:58.4325272Z     
2025-05-07T20:32:58.4325539Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.4325877Z     
2025-05-07T20:32:58.4326067Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.4326365Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.4326664Z         x = x_sign * x_clamp
2025-05-07T20:32:58.4326906Z         x0 = x[:, :D]
2025-05-07T20:32:58.4327115Z         x1 = x[:, D:]
2025-05-07T20:32:58.4327313Z     
2025-05-07T20:32:58.4327490Z         if contiguous:
2025-05-07T20:32:58.4327716Z             x0 = x0.contiguous()
2025-05-07T20:32:58.4327969Z             x1 = x1.contiguous()
2025-05-07T20:32:58.4328204Z     
2025-05-07T20:32:58.4328389Z         if scale_ub is not None:
2025-05-07T20:32:58.4328660Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.4328983Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.4329290Z             )
2025-05-07T20:32:58.4329477Z         else:
2025-05-07T20:32:58.4329679Z             scale_ub_tensor = None
2025-05-07T20:32:58.4329935Z     
2025-05-07T20:32:58.4330159Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.4330462Z             op = silu_mul_quant
2025-05-07T20:32:58.4330864Z             if compiled:
2025-05-07T20:32:58.4331116Z                 op = torch.compile(op)
2025-05-07T20:32:58.4331401Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.4331676Z     
2025-05-07T20:32:58.4331866Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.4332030Z 
2025-05-07T20:32:58.4332128Z moe/activation_test.py:117: 
2025-05-07T20:32:58.4332421Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.4332744Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.4333022Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.4333723Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.4334425Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.4334982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.4335676Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.4336432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.4336967Z     kernel = self.compile(
2025-05-07T20:32:58.4337517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.4338166Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.4338568Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.4338790Z 
2025-05-07T20:32:58.4339008Z self = <triton.compiler.compiler.ASTSource object at 0x7f37889da9d0>
2025-05-07T20:32:58.4340470Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.4342321Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37885b42c0>}
2025-05-07T20:32:58.4343659Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.4344666Z context = <triton._C.libtriton.ir.context object at 0x7f3788813ef0>
2025-05-07T20:32:58.4344949Z 
2025-05-07T20:32:58.4345121Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.4345632Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.4346106Z                            module_map=module_map)
2025-05-07T20:32:58.4346471Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.4346830Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.4347081Z E       ^
2025-05-07T20:32:58.4347604Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.4348050Z 
2025-05-07T20:32:58.4348487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.4348988Z 
2025-05-07T20:32:58.4349097Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.4349502Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.4349906Z     T=4096,
2025-05-07T20:32:58.4350096Z     D=5120,
2025-05-07T20:32:58.4350278Z     scale_ub=1200.0,
2025-05-07T20:32:58.4350500Z     contiguous=False,
2025-05-07T20:32:58.4350721Z     compiled=True,
2025-05-07T20:32:58.4350919Z )
2025-05-07T20:32:58.4351234Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.4351843Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:58.4352110Z 
2025-05-07T20:32:58.4352188Z     @given(
2025-05-07T20:32:58.4352420Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.4352737Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.4353040Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.4353365Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.4353691Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.4353977Z     )
2025-05-07T20:32:58.4354315Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.4354752Z     def test_silu_mul_quant(
2025-05-07T20:32:58.4354994Z         self,
2025-05-07T20:32:58.4355180Z         T: int,
2025-05-07T20:32:58.4355381Z         D: int,
2025-05-07T20:32:58.4355607Z         scale_ub: Optional[float],
2025-05-07T20:32:58.4355866Z         contiguous: bool,
2025-05-07T20:32:58.4356113Z         compiled: bool,
2025-05-07T20:32:58.4356346Z     ) -> None:
2025-05-07T20:32:58.4356555Z         torch.manual_seed(2025)
2025-05-07T20:32:58.4356864Z     
2025-05-07T20:32:58.4357140Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.4357487Z     
2025-05-07T20:32:58.4357677Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.4357970Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.4358283Z         x = x_sign * x_clamp
2025-05-07T20:32:58.4358518Z         x0 = x[:, :D]
2025-05-07T20:32:58.4358738Z         x1 = x[:, D:]
2025-05-07T20:32:58.4358943Z     
2025-05-07T20:32:58.4359120Z         if contiguous:
2025-05-07T20:32:58.4359347Z             x0 = x0.contiguous()
2025-05-07T20:32:58.4359600Z             x1 = x1.contiguous()
2025-05-07T20:32:58.4359826Z     
2025-05-07T20:32:58.4360062Z         if scale_ub is not None:
2025-05-07T20:32:58.4360327Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.4360659Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.4360964Z             )
2025-05-07T20:32:58.4361156Z         else:
2025-05-07T20:32:58.4361355Z             scale_ub_tensor = None
2025-05-07T20:32:58.4361601Z     
2025-05-07T20:32:58.4361829Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.4362140Z             op = silu_mul_quant
2025-05-07T20:32:58.4362380Z             if compiled:
2025-05-07T20:32:58.4362621Z                 op = torch.compile(op)
2025-05-07T20:32:58.4362912Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.4363170Z     
2025-05-07T20:32:58.4363362Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.4363524Z 
2025-05-07T20:32:58.4363625Z moe/activation_test.py:117: 
2025-05-07T20:32:58.4363910Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.4364237Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.4364512Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.4365065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:58.4365622Z     return fn(*args, **kwargs)
2025-05-07T20:32:58.4366269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.4366947Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.4367482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.4368158Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.4368809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.4369327Z     kernel = self.compile(
2025-05-07T20:32:58.4369877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.4370607Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.4371010Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.4371232Z 
2025-05-07T20:32:58.4371433Z self = <triton.compiler.compiler.ASTSource object at 0x7f37ae08ed50>
2025-05-07T20:32:58.4372502Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.4373853Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37885b5b20>}
2025-05-07T20:32:58.4375170Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.4376202Z context = <triton._C.libtriton.ir.context object at 0x7f35d7a7f3b0>
2025-05-07T20:32:58.4376530Z 
2025-05-07T20:32:58.4376694Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.4377210Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.4377676Z                            module_map=module_map)
2025-05-07T20:32:58.4378041Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.4378385Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.4378643Z E       ^
2025-05-07T20:32:58.4379109Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.4379549Z 
2025-05-07T20:32:58.4379972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.4380525Z 
2025-05-07T20:32:58.4380626Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.4381040Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.4381453Z     T=2048,
2025-05-07T20:32:58.4381631Z     D=7168,
2025-05-07T20:32:58.4381823Z     scale_ub=1200.0,
2025-05-07T20:32:58.4382055Z     contiguous=False,
2025-05-07T20:32:58.4382277Z     compiled=False,
2025-05-07T20:32:58.6387105Z )
2025-05-07T20:32:58.6387609Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.6388335Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:58.6388735Z 
2025-05-07T20:32:58.6388814Z     @given(
2025-05-07T20:32:58.6389056Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.6389361Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.6389684Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.6390011Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.6390338Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.6390620Z     )
2025-05-07T20:32:58.6390974Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.6391443Z     def test_silu_mul_quant(
2025-05-07T20:32:58.6391680Z         self,
2025-05-07T20:32:58.6391874Z         T: int,
2025-05-07T20:32:58.6392060Z         D: int,
2025-05-07T20:32:58.6392284Z         scale_ub: Optional[float],
2025-05-07T20:32:58.6392551Z         contiguous: bool,
2025-05-07T20:32:58.6392781Z         compiled: bool,
2025-05-07T20:32:58.6393015Z     ) -> None:
2025-05-07T20:32:58.6393238Z         torch.manual_seed(2025)
2025-05-07T20:32:58.6393477Z     
2025-05-07T20:32:58.6393744Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.6394101Z     
2025-05-07T20:32:58.6394301Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.6394592Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.6394902Z         x = x_sign * x_clamp
2025-05-07T20:32:58.6395509Z         x0 = x[:, :D]
2025-05-07T20:32:58.6395731Z         x1 = x[:, D:]
2025-05-07T20:32:58.6395942Z     
2025-05-07T20:32:58.6396124Z         if contiguous:
2025-05-07T20:32:58.6396349Z             x0 = x0.contiguous()
2025-05-07T20:32:58.6396622Z             x1 = x1.contiguous()
2025-05-07T20:32:58.6396874Z     
2025-05-07T20:32:58.6397057Z         if scale_ub is not None:
2025-05-07T20:32:58.6397332Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.6397669Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.6397990Z             )
2025-05-07T20:32:58.6398195Z         else:
2025-05-07T20:32:58.6398409Z             scale_ub_tensor = None
2025-05-07T20:32:58.6398663Z     
2025-05-07T20:32:58.6398889Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.6399207Z             op = silu_mul_quant
2025-05-07T20:32:58.6399455Z             if compiled:
2025-05-07T20:32:58.6399698Z                 op = torch.compile(op)
2025-05-07T20:32:58.6400004Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.6400354Z     
2025-05-07T20:32:58.6400538Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.6400707Z 
2025-05-07T20:32:58.6400806Z moe/activation_test.py:117: 
2025-05-07T20:32:58.6401105Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.6401423Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.6401700Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.6402392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.6403070Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.6403597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.6404355Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.6405024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.6405545Z     kernel = self.compile(
2025-05-07T20:32:58.6406101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.6406745Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.6407138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.6407360Z 
2025-05-07T20:32:58.6407565Z self = <triton.compiler.compiler.ASTSource object at 0x7f37aec56f50>
2025-05-07T20:32:58.6408660Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.6410035Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37885b6700>}
2025-05-07T20:32:58.6411396Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.6412396Z context = <triton._C.libtriton.ir.context object at 0x7f35d79c2830>
2025-05-07T20:32:58.6412675Z 
2025-05-07T20:32:58.6412840Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.6413354Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.6413816Z                            module_map=module_map)
2025-05-07T20:32:58.6414172Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.6414536Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.6414795Z E       ^
2025-05-07T20:32:58.6415977Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.6416426Z 
2025-05-07T20:32:58.6416854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.6417362Z 
2025-05-07T20:32:58.6417463Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.6417868Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.6418272Z     T=1,
2025-05-07T20:32:58.6418451Z     D=7168,
2025-05-07T20:32:58.6418645Z     scale_ub=None,
2025-05-07T20:32:58.6418862Z     contiguous=True,
2025-05-07T20:32:58.6419080Z     compiled=False,
2025-05-07T20:32:58.6419290Z )
2025-05-07T20:32:58.6419612Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.6420129Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:32:58.6420382Z 
2025-05-07T20:32:58.6420466Z     @given(
2025-05-07T20:32:58.6420692Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.6421063Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.6421368Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.6421686Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.6422010Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.6422293Z     )
2025-05-07T20:32:58.6422633Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.6423082Z     def test_silu_mul_quant(
2025-05-07T20:32:58.6423325Z         self,
2025-05-07T20:32:58.6423512Z         T: int,
2025-05-07T20:32:58.6423711Z         D: int,
2025-05-07T20:32:58.6423930Z         scale_ub: Optional[float],
2025-05-07T20:32:58.6424202Z         contiguous: bool,
2025-05-07T20:32:58.6424483Z         compiled: bool,
2025-05-07T20:32:58.6424709Z     ) -> None:
2025-05-07T20:32:58.6424929Z         torch.manual_seed(2025)
2025-05-07T20:32:58.6425160Z     
2025-05-07T20:32:58.6425432Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.6425782Z     
2025-05-07T20:32:58.6425967Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.6426251Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.6426558Z         x = x_sign * x_clamp
2025-05-07T20:32:58.6426792Z         x0 = x[:, :D]
2025-05-07T20:32:58.6427006Z         x1 = x[:, D:]
2025-05-07T20:32:58.6427216Z     
2025-05-07T20:32:58.6427510Z         if contiguous:
2025-05-07T20:32:58.6427741Z             x0 = x0.contiguous()
2025-05-07T20:32:58.6427999Z             x1 = x1.contiguous()
2025-05-07T20:32:58.6428228Z     
2025-05-07T20:32:58.6428425Z         if scale_ub is not None:
2025-05-07T20:32:58.6428695Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.6429024Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.6429324Z             )
2025-05-07T20:32:58.6429519Z         else:
2025-05-07T20:32:58.6429732Z             scale_ub_tensor = None
2025-05-07T20:32:58.6429978Z     
2025-05-07T20:32:58.6430213Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.6430526Z             op = silu_mul_quant
2025-05-07T20:32:58.6430771Z             if compiled:
2025-05-07T20:32:58.6431019Z                 op = torch.compile(op)
2025-05-07T20:32:58.6431313Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.6431577Z     
2025-05-07T20:32:58.6431776Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.6431937Z 
2025-05-07T20:32:58.6432040Z moe/activation_test.py:117: 
2025-05-07T20:32:58.6432327Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.6432658Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.6432935Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.6433639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.6434403Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.6434947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.6435624Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.6436276Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.6436799Z     kernel = self.compile(
2025-05-07T20:32:58.6437338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.6437982Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.6438365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.6438595Z 
2025-05-07T20:32:58.6438797Z self = <triton.compiler.compiler.ASTSource object at 0x7f3789cbf550>
2025-05-07T20:32:58.6439865Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.6441636Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37885b7a60>}
2025-05-07T20:32:58.6442950Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.6443960Z context = <triton._C.libtriton.ir.context object at 0x7f35d7976bf0>
2025-05-07T20:32:58.6444252Z 
2025-05-07T20:32:58.6444417Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.6445039Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.6445499Z                            module_map=module_map)
2025-05-07T20:32:58.6445868Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.6446221Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.6446479Z E       ^
2025-05-07T20:32:58.6446931Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.6447380Z 
2025-05-07T20:32:58.6447796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.6448297Z 
2025-05-07T20:32:58.6448408Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.6448813Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.6449218Z     T=16384,
2025-05-07T20:32:58.6449419Z     D=7168,
2025-05-07T20:32:58.6449613Z     scale_ub=1200.0,
2025-05-07T20:32:58.6449826Z     contiguous=False,
2025-05-07T20:32:58.6460639Z     compiled=True,
2025-05-07T20:32:58.6460898Z )
2025-05-07T20:32:58.6461285Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.6461787Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:58.6462064Z 
2025-05-07T20:32:58.6462146Z     @given(
2025-05-07T20:32:58.6462375Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.6462694Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.6463001Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.6463322Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.6463653Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.6463942Z     )
2025-05-07T20:32:58.6464288Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.6464752Z     def test_silu_mul_quant(
2025-05-07T20:32:58.6464992Z         self,
2025-05-07T20:32:58.6465183Z         T: int,
2025-05-07T20:32:58.6465568Z         D: int,
2025-05-07T20:32:58.6465782Z         scale_ub: Optional[float],
2025-05-07T20:32:58.6466055Z         contiguous: bool,
2025-05-07T20:32:58.6466293Z         compiled: bool,
2025-05-07T20:32:58.6466516Z     ) -> None:
2025-05-07T20:32:58.6466730Z         torch.manual_seed(2025)
2025-05-07T20:32:58.6466979Z     
2025-05-07T20:32:58.6467243Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.6467650Z     
2025-05-07T20:32:58.6467842Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.6468132Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.6468443Z         x = x_sign * x_clamp
2025-05-07T20:32:58.6468684Z         x0 = x[:, :D]
2025-05-07T20:32:58.6468893Z         x1 = x[:, D:]
2025-05-07T20:32:58.6469088Z     
2025-05-07T20:32:58.6469275Z         if contiguous:
2025-05-07T20:32:58.6469505Z             x0 = x0.contiguous()
2025-05-07T20:32:58.6469752Z             x1 = x1.contiguous()
2025-05-07T20:32:58.6469987Z     
2025-05-07T20:32:58.6470183Z         if scale_ub is not None:
2025-05-07T20:32:58.6470549Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.6470880Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.6471185Z             )
2025-05-07T20:32:58.6471373Z         else:
2025-05-07T20:32:58.6471581Z             scale_ub_tensor = None
2025-05-07T20:32:58.6471828Z     
2025-05-07T20:32:58.6472050Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.6472358Z             op = silu_mul_quant
2025-05-07T20:32:58.6472605Z             if compiled:
2025-05-07T20:32:58.6472845Z                 op = torch.compile(op)
2025-05-07T20:32:58.6473135Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.6473406Z     
2025-05-07T20:32:58.6473603Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.6473817Z 
2025-05-07T20:32:58.6473915Z moe/activation_test.py:117: 
2025-05-07T20:32:58.6474217Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.6474550Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.6474825Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.6475389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:58.6475940Z     return fn(*args, **kwargs)
2025-05-07T20:32:58.6476596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.6477267Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.6477806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.6478484Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.6479149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.6479677Z     kernel = self.compile(
2025-05-07T20:32:58.6480221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.6480876Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.6481271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.6481501Z 
2025-05-07T20:32:58.6481706Z self = <triton.compiler.compiler.ASTSource object at 0x7f35d7c2af50>
2025-05-07T20:32:58.6482774Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.6484184Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f35d7950d60>}
2025-05-07T20:32:58.6485591Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.6486665Z context = <triton._C.libtriton.ir.context object at 0x7f37880d94b0>
2025-05-07T20:32:58.6486958Z 
2025-05-07T20:32:58.6487123Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.6487654Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.6488108Z                            module_map=module_map)
2025-05-07T20:32:58.6488475Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.6488832Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.6489096Z E       ^
2025-05-07T20:32:58.6489550Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.6490009Z 
2025-05-07T20:32:58.6490433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.7801334Z 
2025-05-07T20:32:58.7802276Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.7802951Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.7803471Z     T=1,
2025-05-07T20:32:58.7803672Z     D=7168,
2025-05-07T20:32:58.7803872Z     scale_ub=None,
2025-05-07T20:32:58.7804086Z     contiguous=False,
2025-05-07T20:32:58.7804326Z     compiled=False,
2025-05-07T20:32:58.7804542Z )
2025-05-07T20:32:58.7804868Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.7805361Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:32:58.7805641Z 
2025-05-07T20:32:58.7806037Z     @given(
2025-05-07T20:32:58.7806268Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.7806577Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.7806894Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.7807238Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.7807562Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.7807852Z     )
2025-05-07T20:32:58.7808211Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.7808647Z     def test_silu_mul_quant(
2025-05-07T20:32:58.7808893Z         self,
2025-05-07T20:32:58.7809092Z         T: int,
2025-05-07T20:32:58.7809286Z         D: int,
2025-05-07T20:32:58.7809509Z         scale_ub: Optional[float],
2025-05-07T20:32:58.7809788Z         contiguous: bool,
2025-05-07T20:32:58.7810048Z         compiled: bool,
2025-05-07T20:32:58.7810275Z     ) -> None:
2025-05-07T20:32:58.7810493Z         torch.manual_seed(2025)
2025-05-07T20:32:58.7810745Z     
2025-05-07T20:32:58.7811010Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.7811350Z     
2025-05-07T20:32:58.7811552Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.7811839Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.7812151Z         x = x_sign * x_clamp
2025-05-07T20:32:58.7812400Z         x0 = x[:, :D]
2025-05-07T20:32:58.7812611Z         x1 = x[:, D:]
2025-05-07T20:32:58.7812827Z     
2025-05-07T20:32:58.7813023Z         if contiguous:
2025-05-07T20:32:58.7813255Z             x0 = x0.contiguous()
2025-05-07T20:32:58.7813530Z             x1 = x1.contiguous()
2025-05-07T20:32:58.7813776Z     
2025-05-07T20:32:58.7813974Z         if scale_ub is not None:
2025-05-07T20:32:58.7814249Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.7814586Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.7814921Z             )
2025-05-07T20:32:58.7815122Z         else:
2025-05-07T20:32:58.7815328Z             scale_ub_tensor = None
2025-05-07T20:32:58.7815581Z     
2025-05-07T20:32:58.7815982Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.7816287Z             op = silu_mul_quant
2025-05-07T20:32:58.7816541Z             if compiled:
2025-05-07T20:32:58.7816792Z                 op = torch.compile(op)
2025-05-07T20:32:58.7817090Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.7817356Z     
2025-05-07T20:32:58.7817551Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.7817717Z 
2025-05-07T20:32:58.7817827Z moe/activation_test.py:117: 
2025-05-07T20:32:58.7818117Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.7818451Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.7818736Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.7819426Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.7820113Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.7820658Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.7821432Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.7822103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.7822646Z     kernel = self.compile(
2025-05-07T20:32:58.7823210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.7823856Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.7824243Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.7824472Z 
2025-05-07T20:32:58.7824676Z self = <triton.compiler.compiler.ASTSource object at 0x7f3788f120d0>
2025-05-07T20:32:58.7825805Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.7827179Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f35d7951760>}
2025-05-07T20:32:58.7828610Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.7829623Z context = <triton._C.libtriton.ir.context object at 0x7f37880a6430>
2025-05-07T20:32:58.7829912Z 
2025-05-07T20:32:58.7830077Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.7830602Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.7831065Z                            module_map=module_map)
2025-05-07T20:32:58.7831437Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.7831794Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.7832054Z E       ^
2025-05-07T20:32:58.7832519Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.7832974Z 
2025-05-07T20:32:58.7833406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.7833916Z 
2025-05-07T20:32:58.7834026Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.7834427Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.7834834Z     T=2048,
2025-05-07T20:32:58.7835029Z     D=7168,
2025-05-07T20:32:58.7835217Z     scale_ub=None,
2025-05-07T20:32:58.7835436Z     contiguous=False,
2025-05-07T20:32:58.7835671Z     compiled=True,
2025-05-07T20:32:58.7835868Z )
2025-05-07T20:32:58.7836189Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:58.7836758Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:58.7837037Z 
2025-05-07T20:32:58.7837125Z     @given(
2025-05-07T20:32:58.7837348Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:58.7837657Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:58.7837960Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:58.7838278Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:58.7838603Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:58.7838883Z     )
2025-05-07T20:32:58.7839222Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:58.7839667Z     def test_silu_mul_quant(
2025-05-07T20:32:58.7839907Z         self,
2025-05-07T20:32:58.7840287Z         T: int,
2025-05-07T20:32:58.7840479Z         D: int,
2025-05-07T20:32:58.7840697Z         scale_ub: Optional[float],
2025-05-07T20:32:58.7840976Z         contiguous: bool,
2025-05-07T20:32:58.7841229Z         compiled: bool,
2025-05-07T20:32:58.7841527Z     ) -> None:
2025-05-07T20:32:58.7841739Z         torch.manual_seed(2025)
2025-05-07T20:32:58.7841984Z     
2025-05-07T20:32:58.7842262Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:58.7842613Z     
2025-05-07T20:32:58.7842800Z         x_sign = torch.sign(x)
2025-05-07T20:32:58.7843093Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:58.7843403Z         x = x_sign * x_clamp
2025-05-07T20:32:58.7843642Z         x0 = x[:, :D]
2025-05-07T20:32:58.7843872Z         x1 = x[:, D:]
2025-05-07T20:32:58.7844094Z     
2025-05-07T20:32:58.7844279Z         if contiguous:
2025-05-07T20:32:58.7844518Z             x0 = x0.contiguous()
2025-05-07T20:32:58.7844784Z             x1 = x1.contiguous()
2025-05-07T20:32:58.7845099Z     
2025-05-07T20:32:58.7845304Z         if scale_ub is not None:
2025-05-07T20:32:58.7845581Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:58.7845923Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:58.7846237Z             )
2025-05-07T20:32:58.7846431Z         else:
2025-05-07T20:32:58.7846634Z             scale_ub_tensor = None
2025-05-07T20:32:58.7846881Z     
2025-05-07T20:32:58.7847108Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:58.7847420Z             op = silu_mul_quant
2025-05-07T20:32:58.7847662Z             if compiled:
2025-05-07T20:32:58.7847916Z                 op = torch.compile(op)
2025-05-07T20:32:58.7848227Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.7848502Z     
2025-05-07T20:32:58.7848708Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:58.7848869Z 
2025-05-07T20:32:58.7848971Z moe/activation_test.py:117: 
2025-05-07T20:32:58.7849265Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.7849602Z moe/activation_test.py:115: in fn
2025-05-07T20:32:58.7849893Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:58.7850442Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:58.7851007Z     return fn(*args, **kwargs)
2025-05-07T20:32:58.7851671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:58.7852353Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:58.7852885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:58.7853566Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:58.7854227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:58.7854758Z     kernel = self.compile(
2025-05-07T20:32:58.7855428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:58.7856096Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:58.7856498Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:58.7856722Z 
2025-05-07T20:32:58.7856925Z self = <triton.compiler.compiler.ASTSource object at 0x7f3788688350>
2025-05-07T20:32:58.7857994Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:58.7859356Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f35d7952f20>}
2025-05-07T20:32:58.7860686Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:58.7861799Z context = <triton._C.libtriton.ir.context object at 0x7f3788033770>
2025-05-07T20:32:58.7862090Z 
2025-05-07T20:32:58.7862256Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:58.7862783Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:58.7863254Z                            module_map=module_map)
2025-05-07T20:32:58.7863627Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:58.7863984Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:58.7864261Z E       ^
2025-05-07T20:32:58.7864737Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:58.7865292Z 
2025-05-07T20:32:58.7865732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:58.7866257Z 
2025-05-07T20:32:58.7866365Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:58.7866778Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:58.7867179Z     T=4096,
2025-05-07T20:32:58.7867362Z     D=7168,
2025-05-07T20:32:58.7867628Z     scale_ub=None,
2025-05-07T20:32:58.7867863Z     contiguous=False,
2025-05-07T20:32:58.7868086Z     compiled=True,
2025-05-07T20:32:59.0133434Z )
2025-05-07T20:32:59.0134389Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.0135837Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:59.0136601Z 
2025-05-07T20:32:59.0136782Z     @given(
2025-05-07T20:32:59.0137231Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.0137848Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.0138452Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.0139112Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.0139752Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.0140612Z     )
2025-05-07T20:32:59.0141248Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.0141698Z     def test_silu_mul_quant(
2025-05-07T20:32:59.0141935Z         self,
2025-05-07T20:32:59.0142132Z         T: int,
2025-05-07T20:32:59.0142328Z         D: int,
2025-05-07T20:32:59.0142538Z         scale_ub: Optional[float],
2025-05-07T20:32:59.0142808Z         contiguous: bool,
2025-05-07T20:32:59.0143047Z         compiled: bool,
2025-05-07T20:32:59.0143266Z     ) -> None:
2025-05-07T20:32:59.0143486Z         torch.manual_seed(2025)
2025-05-07T20:32:59.0143729Z     
2025-05-07T20:32:59.0143991Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.0144329Z     
2025-05-07T20:32:59.0144527Z         x_sign = torch.sign(x)
2025-05-07T20:32:59.0144811Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:59.0145317Z         x = x_sign * x_clamp
2025-05-07T20:32:59.0145569Z         x0 = x[:, :D]
2025-05-07T20:32:59.0145790Z         x1 = x[:, D:]
2025-05-07T20:32:59.0145994Z     
2025-05-07T20:32:59.0146186Z         if contiguous:
2025-05-07T20:32:59.0146424Z             x0 = x0.contiguous()
2025-05-07T20:32:59.0146677Z             x1 = x1.contiguous()
2025-05-07T20:32:59.0146914Z     
2025-05-07T20:32:59.0147103Z         if scale_ub is not None:
2025-05-07T20:32:59.0147368Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:59.0147796Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:59.0148102Z             )
2025-05-07T20:32:59.0148289Z         else:
2025-05-07T20:32:59.0148500Z             scale_ub_tensor = None
2025-05-07T20:32:59.0148745Z     
2025-05-07T20:32:59.0148970Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.0149278Z             op = silu_mul_quant
2025-05-07T20:32:59.0149524Z             if compiled:
2025-05-07T20:32:59.0149770Z                 op = torch.compile(op)
2025-05-07T20:32:59.0150171Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.0150445Z     
2025-05-07T20:32:59.0150635Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:59.0150797Z 
2025-05-07T20:32:59.0150893Z moe/activation_test.py:117: 
2025-05-07T20:32:59.0151183Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.0151510Z moe/activation_test.py:115: in fn
2025-05-07T20:32:59.0151784Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.0152340Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:59.0152900Z     return fn(*args, **kwargs)
2025-05-07T20:32:59.0153547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:59.0154293Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:59.0154832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:59.0155504Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:59.0156155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:59.0156677Z     kernel = self.compile(
2025-05-07T20:32:59.0157219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:59.0157864Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:59.0158254Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.0158486Z 
2025-05-07T20:32:59.0158689Z self = <triton.compiler.compiler.ASTSource object at 0x7f3788f10550>
2025-05-07T20:32:59.0159766Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:59.0161124Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37888280e0>}
2025-05-07T20:32:59.0162437Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:59.0163440Z context = <triton._C.libtriton.ir.context object at 0x7f378888b0f0>
2025-05-07T20:32:59.0163734Z 
2025-05-07T20:32:59.0163898Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:59.0164422Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:59.0164880Z                            module_map=module_map)
2025-05-07T20:32:59.0165325Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:59.0165676Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:59.0165928Z E       ^
2025-05-07T20:32:59.0166392Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:59.0166846Z 
2025-05-07T20:32:59.0167272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:59.0167774Z 
2025-05-07T20:32:59.0167883Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.0168284Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.0168682Z     T=16384,
2025-05-07T20:32:59.0168877Z     D=5120,
2025-05-07T20:32:59.0169071Z     scale_ub=1200.0,
2025-05-07T20:32:59.0169294Z     contiguous=False,
2025-05-07T20:32:59.0169516Z     compiled=False,
2025-05-07T20:32:59.0169723Z )
2025-05-07T20:32:59.0170034Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.0170571Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:59.0170844Z 
2025-05-07T20:32:59.0170926Z     @given(
2025-05-07T20:32:59.0171169Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.0171507Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.0171811Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.0172129Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.0172452Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.0172735Z     )
2025-05-07T20:32:59.0173088Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.0173513Z     def test_silu_mul_quant(
2025-05-07T20:32:59.0173803Z         self,
2025-05-07T20:32:59.0173995Z         T: int,
2025-05-07T20:32:59.0174184Z         D: int,
2025-05-07T20:32:59.0174412Z         scale_ub: Optional[float],
2025-05-07T20:32:59.0174679Z         contiguous: bool,
2025-05-07T20:32:59.0174911Z         compiled: bool,
2025-05-07T20:32:59.0175130Z     ) -> None:
2025-05-07T20:32:59.0175344Z         torch.manual_seed(2025)
2025-05-07T20:32:59.0175576Z     
2025-05-07T20:32:59.0175845Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.0176183Z     
2025-05-07T20:32:59.0176365Z         x_sign = torch.sign(x)
2025-05-07T20:32:59.0176653Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:59.0176957Z         x = x_sign * x_clamp
2025-05-07T20:32:59.0177188Z         x0 = x[:, :D]
2025-05-07T20:32:59.0177416Z         x1 = x[:, D:]
2025-05-07T20:32:59.0177618Z     
2025-05-07T20:32:59.0177800Z         if contiguous:
2025-05-07T20:32:59.0178028Z             x0 = x0.contiguous()
2025-05-07T20:32:59.0178279Z             x1 = x1.contiguous()
2025-05-07T20:32:59.0178517Z     
2025-05-07T20:32:59.0178708Z         if scale_ub is not None:
2025-05-07T20:32:59.0178980Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:59.0187346Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:59.0187760Z             )
2025-05-07T20:32:59.0187949Z         else:
2025-05-07T20:32:59.0188154Z             scale_ub_tensor = None
2025-05-07T20:32:59.0188396Z     
2025-05-07T20:32:59.0188620Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.0188928Z             op = silu_mul_quant
2025-05-07T20:32:59.0189171Z             if compiled:
2025-05-07T20:32:59.0189415Z                 op = torch.compile(op)
2025-05-07T20:32:59.0189716Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.0189989Z     
2025-05-07T20:32:59.0190175Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:59.0190351Z 
2025-05-07T20:32:59.0190456Z moe/activation_test.py:117: 
2025-05-07T20:32:59.0190753Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.0191197Z moe/activation_test.py:115: in fn
2025-05-07T20:32:59.0191487Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.0192184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:59.0192868Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:59.0193395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:59.0194105Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:59.0194758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:59.0195272Z     kernel = self.compile(
2025-05-07T20:32:59.0195827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:59.0196473Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:59.0196874Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.0197141Z 
2025-05-07T20:32:59.0197346Z self = <triton.compiler.compiler.ASTSource object at 0x7f37886896d0>
2025-05-07T20:32:59.0198410Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:59.0199755Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3788828b80>}
2025-05-07T20:32:59.0201109Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:59.0202168Z context = <triton._C.libtriton.ir.context object at 0x7f3788858eb0>
2025-05-07T20:32:59.0202449Z 
2025-05-07T20:32:59.0202618Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:59.0203146Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:59.0203600Z                            module_map=module_map)
2025-05-07T20:32:59.0203957Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:59.0204310Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:59.0204567Z E       ^
2025-05-07T20:32:59.0205022Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:59.0205474Z 
2025-05-07T20:32:59.0205884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:59.0206416Z 
2025-05-07T20:32:59.0206516Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.0206930Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.0207329Z     T=16384,
2025-05-07T20:32:59.0207526Z     D=5120,
2025-05-07T20:32:59.0207715Z     scale_ub=1200.0,
2025-05-07T20:32:59.0207926Z     contiguous=True,
2025-05-07T20:32:59.0208143Z     compiled=True,
2025-05-07T20:32:59.0208343Z )
2025-05-07T20:32:59.0208655Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.0209136Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:59.0209411Z 
2025-05-07T20:32:59.0209488Z     @given(
2025-05-07T20:32:59.0209714Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.0210028Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.0210339Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.0210669Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.0210987Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.0211274Z     )
2025-05-07T20:32:59.0211702Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.0212151Z     def test_silu_mul_quant(
2025-05-07T20:32:59.0212387Z         self,
2025-05-07T20:32:59.0212583Z         T: int,
2025-05-07T20:32:59.0212788Z         D: int,
2025-05-07T20:32:59.0213000Z         scale_ub: Optional[float],
2025-05-07T20:32:59.0213268Z         contiguous: bool,
2025-05-07T20:32:59.0213509Z         compiled: bool,
2025-05-07T20:32:59.0213729Z     ) -> None:
2025-05-07T20:32:59.0213943Z         torch.manual_seed(2025)
2025-05-07T20:32:59.0214182Z     
2025-05-07T20:32:59.0214447Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.0214785Z     
2025-05-07T20:32:59.0214976Z         x_sign = torch.sign(x)
2025-05-07T20:32:59.0215260Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:59.0215563Z         x = x_sign * x_clamp
2025-05-07T20:32:59.0215799Z         x0 = x[:, :D]
2025-05-07T20:32:59.0216013Z         x1 = x[:, D:]
2025-05-07T20:32:59.0216264Z     
2025-05-07T20:32:59.0216445Z         if contiguous:
2025-05-07T20:32:59.0216662Z             x0 = x0.contiguous()
2025-05-07T20:32:59.0216910Z             x1 = x1.contiguous()
2025-05-07T20:32:59.0217150Z     
2025-05-07T20:32:59.0217335Z         if scale_ub is not None:
2025-05-07T20:32:59.0217595Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:59.0217925Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:59.0218224Z             )
2025-05-07T20:32:59.0218406Z         else:
2025-05-07T20:32:59.0218614Z             scale_ub_tensor = None
2025-05-07T20:32:59.0218856Z     
2025-05-07T20:32:59.0219077Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.0219385Z             op = silu_mul_quant
2025-05-07T20:32:59.0219678Z             if compiled:
2025-05-07T20:32:59.0219914Z                 op = torch.compile(op)
2025-05-07T20:32:59.0220212Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.0220477Z     
2025-05-07T20:32:59.0220665Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:59.0220836Z 
2025-05-07T20:32:59.0220929Z moe/activation_test.py:117: 
2025-05-07T20:32:59.0221228Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.0221605Z moe/activation_test.py:115: in fn
2025-05-07T20:32:59.0221872Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.0222425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:59.0222979Z     return fn(*args, **kwargs)
2025-05-07T20:32:59.0223654Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:59.0224339Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:59.0224884Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:59.0225571Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:59.0226260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:59.0226787Z     kernel = self.compile(
2025-05-07T20:32:59.0227328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:59.0228019Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:59.0228413Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.0228646Z 
2025-05-07T20:32:59.0228849Z self = <triton.compiler.compiler.ASTSource object at 0x7f37889db050>
2025-05-07T20:32:59.0230025Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:59.0231383Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f378882a2a0>}
2025-05-07T20:32:59.0232697Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:59.0233706Z context = <triton._C.libtriton.ir.context object at 0x7f35d7807f30>
2025-05-07T20:32:59.0234002Z 
2025-05-07T20:32:59.0234165Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:59.0234683Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:59.0235144Z                            module_map=module_map)
2025-05-07T20:32:59.0235504Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:59.0235860Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:59.0236154Z E       ^
2025-05-07T20:32:59.0236614Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:59.0237070Z 
2025-05-07T20:32:59.0237477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:59.1790547Z 
2025-05-07T20:32:59.1791081Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.1791678Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.1792238Z     T=16384,
2025-05-07T20:32:59.1792541Z     D=5120,
2025-05-07T20:32:59.1792799Z     scale_ub=None,
2025-05-07T20:32:59.1793056Z     contiguous=False,
2025-05-07T20:32:59.1793283Z     compiled=True,
2025-05-07T20:32:59.1793626Z )
2025-05-07T20:32:59.1793946Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.1794449Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:59.1794740Z 
2025-05-07T20:32:59.1794816Z     @given(
2025-05-07T20:32:59.1795038Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.1795358Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.1795657Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.1795978Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.1796303Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.1796580Z     )
2025-05-07T20:32:59.1796927Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.1797382Z     def test_silu_mul_quant(
2025-05-07T20:32:59.1797620Z         self,
2025-05-07T20:32:59.1797812Z         T: int,
2025-05-07T20:32:59.1798016Z         D: int,
2025-05-07T20:32:59.1798229Z         scale_ub: Optional[float],
2025-05-07T20:32:59.1798501Z         contiguous: bool,
2025-05-07T20:32:59.1798740Z         compiled: bool,
2025-05-07T20:32:59.1798977Z     ) -> None:
2025-05-07T20:32:59.1799184Z         torch.manual_seed(2025)
2025-05-07T20:32:59.1799428Z     
2025-05-07T20:32:59.1799703Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.1800047Z     
2025-05-07T20:32:59.1800248Z         x_sign = torch.sign(x)
2025-05-07T20:32:59.1800541Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:59.1800840Z         x = x_sign * x_clamp
2025-05-07T20:32:59.1801112Z         x0 = x[:, :D]
2025-05-07T20:32:59.1801357Z         x1 = x[:, D:]
2025-05-07T20:32:59.1801561Z     
2025-05-07T20:32:59.1801749Z         if contiguous:
2025-05-07T20:32:59.1801980Z             x0 = x0.contiguous()
2025-05-07T20:32:59.1802227Z             x1 = x1.contiguous()
2025-05-07T20:32:59.1802466Z     
2025-05-07T20:32:59.1802670Z         if scale_ub is not None:
2025-05-07T20:32:59.1802936Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:59.1803395Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:59.1803712Z             )
2025-05-07T20:32:59.1803903Z         else:
2025-05-07T20:32:59.1804103Z             scale_ub_tensor = None
2025-05-07T20:32:59.1804349Z     
2025-05-07T20:32:59.1804576Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.1804876Z             op = silu_mul_quant
2025-05-07T20:32:59.1805122Z             if compiled:
2025-05-07T20:32:59.1805368Z                 op = torch.compile(op)
2025-05-07T20:32:59.1805653Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.1805923Z     
2025-05-07T20:32:59.1806114Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:59.1806275Z 
2025-05-07T20:32:59.1806371Z moe/activation_test.py:117: 
2025-05-07T20:32:59.1806661Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.1806986Z moe/activation_test.py:115: in fn
2025-05-07T20:32:59.1807256Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.1807828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:59.1808461Z     return fn(*args, **kwargs)
2025-05-07T20:32:59.1809122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:59.1809790Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:59.1810337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:59.1811015Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:59.1811691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:59.1812206Z     kernel = self.compile(
2025-05-07T20:32:59.1812800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:59.1813466Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:59.1813859Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.1814090Z 
2025-05-07T20:32:59.1814294Z self = <triton.compiler.compiler.ASTSource object at 0x7f3788e283d0>
2025-05-07T20:32:59.1815357Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:59.1816703Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f378882b060>}
2025-05-07T20:32:59.1818048Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:59.1819056Z context = <triton._C.libtriton.ir.context object at 0x7f35d78236f0>
2025-05-07T20:32:59.1819347Z 
2025-05-07T20:32:59.1819514Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:59.1820042Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:59.1820507Z                            module_map=module_map)
2025-05-07T20:32:59.1820862Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:59.1821217Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:59.1821480Z E       ^
2025-05-07T20:32:59.1821930Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:59.1822378Z 
2025-05-07T20:32:59.1822805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:59.1823312Z 
2025-05-07T20:32:59.1823499Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.1823910Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.1824303Z     T=2048,
2025-05-07T20:32:59.1824494Z     D=5120,
2025-05-07T20:32:59.1824683Z     scale_ub=None,
2025-05-07T20:32:59.1824892Z     contiguous=False,
2025-05-07T20:32:59.1825115Z     compiled=True,
2025-05-07T20:32:59.1825319Z )
2025-05-07T20:32:59.1825623Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.1826107Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:32:59.1826381Z 
2025-05-07T20:32:59.1826462Z     @given(
2025-05-07T20:32:59.1826686Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.1826985Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.1827291Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.1827692Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.1828018Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.1828342Z     )
2025-05-07T20:32:59.1828688Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.1829121Z     def test_silu_mul_quant(
2025-05-07T20:32:59.1829358Z         self,
2025-05-07T20:32:59.1829550Z         T: int,
2025-05-07T20:32:59.1829740Z         D: int,
2025-05-07T20:32:59.1829950Z         scale_ub: Optional[float],
2025-05-07T20:32:59.1830223Z         contiguous: bool,
2025-05-07T20:32:59.1830451Z         compiled: bool,
2025-05-07T20:32:59.1830668Z     ) -> None:
2025-05-07T20:32:59.1830879Z         torch.manual_seed(2025)
2025-05-07T20:32:59.1831124Z     
2025-05-07T20:32:59.1831424Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.1831764Z     
2025-05-07T20:32:59.1831998Z         x_sign = torch.sign(x)
2025-05-07T20:32:59.1832279Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:59.1832601Z         x = x_sign * x_clamp
2025-05-07T20:32:59.1832841Z         x0 = x[:, :D]
2025-05-07T20:32:59.1833045Z         x1 = x[:, D:]
2025-05-07T20:32:59.1833252Z     
2025-05-07T20:32:59.1833442Z         if contiguous:
2025-05-07T20:32:59.1833669Z             x0 = x0.contiguous()
2025-05-07T20:32:59.1833935Z             x1 = x1.contiguous()
2025-05-07T20:32:59.1834177Z     
2025-05-07T20:32:59.1834362Z         if scale_ub is not None:
2025-05-07T20:32:59.1834645Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:59.1834982Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:59.1835280Z             )
2025-05-07T20:32:59.1835476Z         else:
2025-05-07T20:32:59.1835686Z             scale_ub_tensor = None
2025-05-07T20:32:59.1835933Z     
2025-05-07T20:32:59.1836163Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.1836476Z             op = silu_mul_quant
2025-05-07T20:32:59.1836724Z             if compiled:
2025-05-07T20:32:59.1836962Z                 op = torch.compile(op)
2025-05-07T20:32:59.1837256Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.1837525Z     
2025-05-07T20:32:59.1837706Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:59.1837869Z 
2025-05-07T20:32:59.1837963Z moe/activation_test.py:117: 
2025-05-07T20:32:59.1838245Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.1838559Z moe/activation_test.py:115: in fn
2025-05-07T20:32:59.1838832Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.1839383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:59.1839932Z     return fn(*args, **kwargs)
2025-05-07T20:32:59.1840832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:59.1841519Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:59.1842484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:59.1843171Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:59.1843850Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:59.1844380Z     kernel = self.compile(
2025-05-07T20:32:59.1844927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:59.1845565Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:59.1845947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.1846167Z 
2025-05-07T20:32:59.1846369Z self = <triton.compiler.compiler.ASTSource object at 0x7f35d7c29b50>
2025-05-07T20:32:59.1847426Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:59.1848822Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f35d78f07c0>}
2025-05-07T20:32:59.1850185Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:59.1851179Z context = <triton._C.libtriton.ir.context object at 0x7f35d7bf70b0>
2025-05-07T20:32:59.1851459Z 
2025-05-07T20:32:59.1851626Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:59.1852136Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:59.1852654Z                            module_map=module_map)
2025-05-07T20:32:59.1853021Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:59.1853367Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:59.1853616Z E       ^
2025-05-07T20:32:59.1854076Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:59.1854534Z 
2025-05-07T20:32:59.1854968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:59.5497423Z 
2025-05-07T20:32:59.5498186Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.5498810Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.5499357Z     T=2048,
2025-05-07T20:32:59.5499613Z     D=5120,
2025-05-07T20:32:59.5499874Z     scale_ub=1200.0,
2025-05-07T20:32:59.5500181Z     contiguous=False,
2025-05-07T20:32:59.5500525Z     compiled=True,
2025-05-07T20:32:59.5500781Z )
2025-05-07T20:32:59.5501120Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.5501618Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:59.5501908Z 
2025-05-07T20:32:59.5501988Z     @given(
2025-05-07T20:32:59.5502220Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.5502531Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.5502836Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.5503170Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.5503501Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.5503784Z     )
2025-05-07T20:32:59.5504126Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.5504583Z     def test_silu_mul_quant(
2025-05-07T20:32:59.5504827Z         self,
2025-05-07T20:32:59.5505022Z         T: int,
2025-05-07T20:32:59.5505220Z         D: int,
2025-05-07T20:32:59.5505441Z         scale_ub: Optional[float],
2025-05-07T20:32:59.5506110Z         contiguous: bool,
2025-05-07T20:32:59.5506356Z         compiled: bool,
2025-05-07T20:32:59.5506588Z     ) -> None:
2025-05-07T20:32:59.5506798Z         torch.manual_seed(2025)
2025-05-07T20:32:59.5507042Z     
2025-05-07T20:32:59.5507314Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.5507730Z     
2025-05-07T20:32:59.5507924Z         x_sign = torch.sign(x)
2025-05-07T20:32:59.5508217Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:59.5508518Z         x = x_sign * x_clamp
2025-05-07T20:32:59.5508759Z         x0 = x[:, :D]
2025-05-07T20:32:59.5508977Z         x1 = x[:, D:]
2025-05-07T20:32:59.5509175Z     
2025-05-07T20:32:59.5509357Z         if contiguous:
2025-05-07T20:32:59.5509587Z             x0 = x0.contiguous()
2025-05-07T20:32:59.5509848Z             x1 = x1.contiguous()
2025-05-07T20:32:59.5510081Z     
2025-05-07T20:32:59.5510272Z         if scale_ub is not None:
2025-05-07T20:32:59.5510560Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:59.5510889Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:59.5511283Z             )
2025-05-07T20:32:59.5511477Z         else:
2025-05-07T20:32:59.5511680Z             scale_ub_tensor = None
2025-05-07T20:32:59.5511934Z     
2025-05-07T20:32:59.5512166Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.5512468Z             op = silu_mul_quant
2025-05-07T20:32:59.5512722Z             if compiled:
2025-05-07T20:32:59.5512972Z                 op = torch.compile(op)
2025-05-07T20:32:59.5513258Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.5513530Z     
2025-05-07T20:32:59.5513722Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:59.5513881Z 
2025-05-07T20:32:59.5513986Z moe/activation_test.py:117: 
2025-05-07T20:32:59.5514363Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.5514691Z moe/activation_test.py:115: in fn
2025-05-07T20:32:59.5514980Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.5515533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:59.5516093Z     return fn(*args, **kwargs)
2025-05-07T20:32:59.5516760Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:59.5517461Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:59.5517996Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:59.5518669Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:59.5519324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:59.5519841Z     kernel = self.compile(
2025-05-07T20:32:59.5520392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:59.5521040Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:59.5521430Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.5521654Z 
2025-05-07T20:32:59.5521856Z self = <triton.compiler.compiler.ASTSource object at 0x7f3789030450>
2025-05-07T20:32:59.5522916Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:59.5524300Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f35d78f1580>}
2025-05-07T20:32:59.5525753Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:59.5526757Z context = <triton._C.libtriton.ir.context object at 0x7f35d7b0e6f0>
2025-05-07T20:32:59.5527047Z 
2025-05-07T20:32:59.5527211Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:59.5527728Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:59.5528197Z                            module_map=module_map)
2025-05-07T20:32:59.5528560Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:59.5528913Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:59.5537308Z E       ^
2025-05-07T20:32:59.5537825Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:59.5538280Z 
2025-05-07T20:32:59.5538713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:59.5539225Z 
2025-05-07T20:32:59.5539333Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.5539832Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.5540534Z     T=4096,
2025-05-07T20:32:59.5540726Z     D=5120,
2025-05-07T20:32:59.5540929Z     scale_ub=1200.0,
2025-05-07T20:32:59.5541163Z     contiguous=True,
2025-05-07T20:32:59.5541382Z     compiled=True,
2025-05-07T20:32:59.5541597Z )
2025-05-07T20:32:59.5541921Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.5542415Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:59.5542680Z 
2025-05-07T20:32:59.5542761Z     @given(
2025-05-07T20:32:59.5542998Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.5543407Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.5543710Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.5544051Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.5544387Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.5544667Z     )
2025-05-07T20:32:59.5545015Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.5545466Z     def test_silu_mul_quant(
2025-05-07T20:32:59.5545715Z         self,
2025-05-07T20:32:59.5545907Z         T: int,
2025-05-07T20:32:59.5546109Z         D: int,
2025-05-07T20:32:59.5546332Z         scale_ub: Optional[float],
2025-05-07T20:32:59.5546602Z         contiguous: bool,
2025-05-07T20:32:59.5546845Z         compiled: bool,
2025-05-07T20:32:59.5547073Z     ) -> None:
2025-05-07T20:32:59.5547285Z         torch.manual_seed(2025)
2025-05-07T20:32:59.5547620Z     
2025-05-07T20:32:59.5547899Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.5548247Z     
2025-05-07T20:32:59.5548445Z         x_sign = torch.sign(x)
2025-05-07T20:32:59.5548746Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:59.5549049Z         x = x_sign * x_clamp
2025-05-07T20:32:59.5549299Z         x0 = x[:, :D]
2025-05-07T20:32:59.5549520Z         x1 = x[:, D:]
2025-05-07T20:32:59.5549729Z     
2025-05-07T20:32:59.5549921Z         if contiguous:
2025-05-07T20:32:59.5550158Z             x0 = x0.contiguous()
2025-05-07T20:32:59.5550410Z             x1 = x1.contiguous()
2025-05-07T20:32:59.5550658Z     
2025-05-07T20:32:59.5550853Z         if scale_ub is not None:
2025-05-07T20:32:59.5551122Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:59.5551465Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:59.5551778Z             )
2025-05-07T20:32:59.5551974Z         else:
2025-05-07T20:32:59.5552180Z             scale_ub_tensor = None
2025-05-07T20:32:59.5552433Z     
2025-05-07T20:32:59.5552679Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.5552987Z             op = silu_mul_quant
2025-05-07T20:32:59.5553242Z             if compiled:
2025-05-07T20:32:59.5553628Z                 op = torch.compile(op)
2025-05-07T20:32:59.5553927Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.5554206Z     
2025-05-07T20:32:59.5554404Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:59.5554569Z 
2025-05-07T20:32:59.5554668Z moe/activation_test.py:117: 
2025-05-07T20:32:59.5554968Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.5555304Z moe/activation_test.py:115: in fn
2025-05-07T20:32:59.5555590Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.5556153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:59.5556714Z     return fn(*args, **kwargs)
2025-05-07T20:32:59.5557394Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:59.5558070Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:59.5558625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:59.5559401Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:59.5560064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:59.5560584Z     kernel = self.compile(
2025-05-07T20:32:59.5561132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:59.5561783Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:59.5562172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.5562404Z 
2025-05-07T20:32:59.5562656Z self = <triton.compiler.compiler.ASTSource object at 0x7f37882eaed0>
2025-05-07T20:32:59.5563734Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:59.5565099Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f35d78f2840>}
2025-05-07T20:32:59.5566428Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:59.5567429Z context = <triton._C.libtriton.ir.context object at 0x7f35d7bbe570>
2025-05-07T20:32:59.5567718Z 
2025-05-07T20:32:59.5567884Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:59.5568408Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:59.5568876Z                            module_map=module_map)
2025-05-07T20:32:59.5569233Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:59.5569590Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:59.5569859Z E       ^
2025-05-07T20:32:59.5570311Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:59.5570763Z 
2025-05-07T20:32:59.5571180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:59.7278459Z 
2025-05-07T20:32:59.7278779Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.7279379Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.7279858Z     T=128,
2025-05-07T20:32:59.7280057Z     D=5120,
2025-05-07T20:32:59.7280272Z     scale_ub=1200.0,
2025-05-07T20:32:59.7280506Z     contiguous=False,
2025-05-07T20:32:59.7280739Z     compiled=True,
2025-05-07T20:32:59.7280940Z )
2025-05-07T20:32:59.7281603Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.7282125Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:32:59.7282397Z 
2025-05-07T20:32:59.7282489Z     @given(
2025-05-07T20:32:59.7282718Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.7283042Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.7283360Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.7283690Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.7284022Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.7284317Z     )
2025-05-07T20:32:59.7284665Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.7285124Z     def test_silu_mul_quant(
2025-05-07T20:32:59.7285372Z         self,
2025-05-07T20:32:59.7285568Z         T: int,
2025-05-07T20:32:59.7285774Z         D: int,
2025-05-07T20:32:59.7286010Z         scale_ub: Optional[float],
2025-05-07T20:32:59.7286374Z         contiguous: bool,
2025-05-07T20:32:59.7286624Z         compiled: bool,
2025-05-07T20:32:59.7286856Z     ) -> None:
2025-05-07T20:32:59.7287074Z         torch.manual_seed(2025)
2025-05-07T20:32:59.7287309Z     
2025-05-07T20:32:59.7287585Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.7287930Z     
2025-05-07T20:32:59.7288120Z         x_sign = torch.sign(x)
2025-05-07T20:32:59.7288410Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:59.7288722Z         x = x_sign * x_clamp
2025-05-07T20:32:59.7288952Z         x0 = x[:, :D]
2025-05-07T20:32:59.7289171Z         x1 = x[:, D:]
2025-05-07T20:32:59.7289382Z     
2025-05-07T20:32:59.7289564Z         if contiguous:
2025-05-07T20:32:59.7289882Z             x0 = x0.contiguous()
2025-05-07T20:32:59.7290139Z             x1 = x1.contiguous()
2025-05-07T20:32:59.7290378Z     
2025-05-07T20:32:59.7290570Z         if scale_ub is not None:
2025-05-07T20:32:59.7290852Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:59.7291184Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:59.7291499Z             )
2025-05-07T20:32:59.7291697Z         else:
2025-05-07T20:32:59.7291904Z             scale_ub_tensor = None
2025-05-07T20:32:59.7292155Z     
2025-05-07T20:32:59.7292389Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.7292699Z             op = silu_mul_quant
2025-05-07T20:32:59.7292942Z             if compiled:
2025-05-07T20:32:59.7293188Z                 op = torch.compile(op)
2025-05-07T20:32:59.7293483Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.7293750Z     
2025-05-07T20:32:59.7293940Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:59.7294102Z 
2025-05-07T20:32:59.7294211Z moe/activation_test.py:117: 
2025-05-07T20:32:59.7294496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.7294826Z moe/activation_test.py:115: in fn
2025-05-07T20:32:59.7295107Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.7295668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:59.7296212Z     return fn(*args, **kwargs)
2025-05-07T20:32:59.7296868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:59.7297548Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:59.7298075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:59.7298752Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:59.7299423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:59.7299947Z     kernel = self.compile(
2025-05-07T20:32:59.7300571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:59.7301222Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:59.7301617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.7301839Z 
2025-05-07T20:32:59.7302040Z self = <triton.compiler.compiler.ASTSource object at 0x7f37889da950>
2025-05-07T20:32:59.7303104Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:59.7304532Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f35d78f34c0>}
2025-05-07T20:32:59.7305855Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:59.7306909Z context = <triton._C.libtriton.ir.context object at 0x7f35d7661070>
2025-05-07T20:32:59.7307192Z 
2025-05-07T20:32:59.7307355Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:59.7307984Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:59.7308447Z                            module_map=module_map)
2025-05-07T20:32:59.7308805Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:59.7309145Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:59.7309400Z E       ^
2025-05-07T20:32:59.7309853Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:59.7310349Z 
2025-05-07T20:32:59.7310778Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:59.7311289Z 
2025-05-07T20:32:59.7311391Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.7311797Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.7312190Z     T=16384,
2025-05-07T20:32:59.7312379Z     D=7168,
2025-05-07T20:32:59.7312571Z     scale_ub=1200.0,
2025-05-07T20:32:59.7312793Z     contiguous=True,
2025-05-07T20:32:59.7313005Z     compiled=True,
2025-05-07T20:32:59.7313215Z )
2025-05-07T20:32:59.7313542Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.7314020Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:32:59.7314306Z 
2025-05-07T20:32:59.7314383Z     @given(
2025-05-07T20:32:59.7314613Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.7314927Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.7315234Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.7315562Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.7315904Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.7316189Z     )
2025-05-07T20:32:59.7316540Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.7316979Z     def test_silu_mul_quant(
2025-05-07T20:32:59.7317216Z         self,
2025-05-07T20:32:59.7317431Z         T: int,
2025-05-07T20:32:59.7317638Z         D: int,
2025-05-07T20:32:59.7317854Z         scale_ub: Optional[float],
2025-05-07T20:32:59.7318132Z         contiguous: bool,
2025-05-07T20:32:59.7318379Z         compiled: bool,
2025-05-07T20:32:59.7318604Z     ) -> None:
2025-05-07T20:32:59.7318822Z         torch.manual_seed(2025)
2025-05-07T20:32:59.7319073Z     
2025-05-07T20:32:59.7319346Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.7319692Z     
2025-05-07T20:32:59.7319893Z         x_sign = torch.sign(x)
2025-05-07T20:32:59.7320311Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:59.7320619Z         x = x_sign * x_clamp
2025-05-07T20:32:59.7320866Z         x0 = x[:, :D]
2025-05-07T20:32:59.7321089Z         x1 = x[:, D:]
2025-05-07T20:32:59.7321297Z     
2025-05-07T20:32:59.7321492Z         if contiguous:
2025-05-07T20:32:59.7321725Z             x0 = x0.contiguous()
2025-05-07T20:32:59.7321979Z             x1 = x1.contiguous()
2025-05-07T20:32:59.7322220Z     
2025-05-07T20:32:59.7322417Z         if scale_ub is not None:
2025-05-07T20:32:59.7322683Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:59.7323022Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:59.7323331Z             )
2025-05-07T20:32:59.7323515Z         else:
2025-05-07T20:32:59.7323733Z             scale_ub_tensor = None
2025-05-07T20:32:59.7323987Z     
2025-05-07T20:32:59.7324214Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.7324538Z             op = silu_mul_quant
2025-05-07T20:32:59.7324795Z             if compiled:
2025-05-07T20:32:59.7325101Z                 op = torch.compile(op)
2025-05-07T20:32:59.7325392Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.7325672Z     
2025-05-07T20:32:59.7325869Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:59.7326033Z 
2025-05-07T20:32:59.7326135Z moe/activation_test.py:117: 
2025-05-07T20:32:59.7326437Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.7326771Z moe/activation_test.py:115: in fn
2025-05-07T20:32:59.7327050Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.7327616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:32:59.7328187Z     return fn(*args, **kwargs)
2025-05-07T20:32:59.7328898Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:59.7329604Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:59.7330150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:59.7330834Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:59.7331489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:59.7332022Z     kernel = self.compile(
2025-05-07T20:32:59.7332572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:59.7333229Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:59.7333629Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.7333868Z 
2025-05-07T20:32:59.7334078Z self = <triton.compiler.compiler.ASTSource object at 0x7f35d7b1b150>
2025-05-07T20:32:59.7335153Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:59.7336566Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f35d76fcc20>}
2025-05-07T20:32:59.7337880Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:59.7338898Z context = <triton._C.libtriton.ir.context object at 0x7f35d7644970>
2025-05-07T20:32:59.7339193Z 
2025-05-07T20:32:59.7339365Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:59.7339973Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:59.7340692Z                            module_map=module_map)
2025-05-07T20:32:59.7341065Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:59.7341427Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:59.7341697Z E       ^
2025-05-07T20:32:59.7342158Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:59.7342614Z 
2025-05-07T20:32:59.7343053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:59.8511897Z 
2025-05-07T20:32:59.8512123Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.8512540Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.8513013Z     T=16384,
2025-05-07T20:32:59.8513288Z     D=5120,
2025-05-07T20:32:59.8513557Z     scale_ub=1200.0,
2025-05-07T20:32:59.8513859Z     contiguous=True,
2025-05-07T20:32:59.8514170Z     compiled=False,
2025-05-07T20:32:59.8514454Z )
2025-05-07T20:32:59.8514992Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.8515496Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:32:59.8515769Z 
2025-05-07T20:32:59.8515859Z     @given(
2025-05-07T20:32:59.8516086Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.8516408Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.8516721Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.8517056Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.8517379Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.8517669Z     )
2025-05-07T20:32:59.8518021Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.8518544Z     def test_silu_mul_quant(
2025-05-07T20:32:59.8518796Z         self,
2025-05-07T20:32:59.8518998Z         T: int,
2025-05-07T20:32:59.8519200Z         D: int,
2025-05-07T20:32:59.8519428Z         scale_ub: Optional[float],
2025-05-07T20:32:59.8519704Z         contiguous: bool,
2025-05-07T20:32:59.8519944Z         compiled: bool,
2025-05-07T20:32:59.8520184Z     ) -> None:
2025-05-07T20:32:59.8520409Z         torch.manual_seed(2025)
2025-05-07T20:32:59.8520650Z     
2025-05-07T20:32:59.8520928Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.8521271Z     
2025-05-07T20:32:59.8521464Z         x_sign = torch.sign(x)
2025-05-07T20:32:59.8521764Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:59.8522077Z         x = x_sign * x_clamp
2025-05-07T20:32:59.8522326Z         x0 = x[:, :D]
2025-05-07T20:32:59.8522542Z         x1 = x[:, D:]
2025-05-07T20:32:59.8522763Z     
2025-05-07T20:32:59.8522959Z         if contiguous:
2025-05-07T20:32:59.8523192Z             x0 = x0.contiguous()
2025-05-07T20:32:59.8523459Z             x1 = x1.contiguous()
2025-05-07T20:32:59.8523710Z     
2025-05-07T20:32:59.8523901Z         if scale_ub is not None:
2025-05-07T20:32:59.8524184Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:59.8524527Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:59.8524829Z             )
2025-05-07T20:32:59.8525032Z         else:
2025-05-07T20:32:59.8525251Z             scale_ub_tensor = None
2025-05-07T20:32:59.8525504Z     
2025-05-07T20:32:59.8525744Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.8526062Z             op = silu_mul_quant
2025-05-07T20:32:59.8526310Z             if compiled:
2025-05-07T20:32:59.8526565Z                 op = torch.compile(op)
2025-05-07T20:32:59.8526864Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.8527142Z     
2025-05-07T20:32:59.8527338Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:59.8527513Z 
2025-05-07T20:32:59.8527616Z moe/activation_test.py:117: 
2025-05-07T20:32:59.8528077Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.8528419Z moe/activation_test.py:115: in fn
2025-05-07T20:32:59.8528717Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.8529412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:59.8530094Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:59.8530642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:59.8531331Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:59.8531994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:59.8532518Z     kernel = self.compile(
2025-05-07T20:32:59.8533082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:59.8533747Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:59.8534191Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.8534416Z 
2025-05-07T20:32:59.8534625Z self = <triton.compiler.compiler.ASTSource object at 0x7f3789030550>
2025-05-07T20:32:59.8535713Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:59.8537078Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f35d76fd580>}
2025-05-07T20:32:59.8538413Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:59.8539461Z context = <triton._C.libtriton.ir.context object at 0x7f35d7410e30>
2025-05-07T20:32:59.8539758Z 
2025-05-07T20:32:59.8539928Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:59.8540715Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:59.8541189Z                            module_map=module_map)
2025-05-07T20:32:59.8541550Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:59.8541907Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:59.8542179Z E       ^
2025-05-07T20:32:59.8542676Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:59.8543120Z 
2025-05-07T20:32:59.8543546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:59.8544058Z 
2025-05-07T20:32:59.8544170Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.8544592Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.8545004Z     T=1,
2025-05-07T20:32:59.8545191Z     D=7168,
2025-05-07T20:32:59.8545394Z     scale_ub=1200.0,
2025-05-07T20:32:59.8545628Z     contiguous=False,
2025-05-07T20:32:59.8545853Z     compiled=False,
2025-05-07T20:32:59.8546067Z )
2025-05-07T20:32:59.8546388Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:32:59.8546870Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:32:59.8547151Z 
2025-05-07T20:32:59.8547233Z     @given(
2025-05-07T20:32:59.8547579Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:32:59.8547903Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:32:59.8548212Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:32:59.8548547Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:32:59.8549010Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:32:59.8549291Z     )
2025-05-07T20:32:59.8549645Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:32:59.8550090Z     def test_silu_mul_quant(
2025-05-07T20:32:59.8550330Z         self,
2025-05-07T20:32:59.8550529Z         T: int,
2025-05-07T20:32:59.8550733Z         D: int,
2025-05-07T20:32:59.8550949Z         scale_ub: Optional[float],
2025-05-07T20:32:59.8551229Z         contiguous: bool,
2025-05-07T20:32:59.8551472Z         compiled: bool,
2025-05-07T20:32:59.8551692Z     ) -> None:
2025-05-07T20:32:59.8551917Z         torch.manual_seed(2025)
2025-05-07T20:32:59.8552163Z     
2025-05-07T20:32:59.8552441Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:32:59.8552786Z     
2025-05-07T20:32:59.8552988Z         x_sign = torch.sign(x)
2025-05-07T20:32:59.8553284Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:32:59.8553597Z         x = x_sign * x_clamp
2025-05-07T20:32:59.8553838Z         x0 = x[:, :D]
2025-05-07T20:32:59.8554132Z         x1 = x[:, D:]
2025-05-07T20:32:59.8554338Z     
2025-05-07T20:32:59.8554532Z         if contiguous:
2025-05-07T20:32:59.8554769Z             x0 = x0.contiguous()
2025-05-07T20:32:59.8555021Z             x1 = x1.contiguous()
2025-05-07T20:32:59.8555266Z     
2025-05-07T20:32:59.8555467Z         if scale_ub is not None:
2025-05-07T20:32:59.8555742Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:32:59.8556086Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:32:59.8564345Z             )
2025-05-07T20:32:59.8564565Z         else:
2025-05-07T20:32:59.8564781Z             scale_ub_tensor = None
2025-05-07T20:32:59.8565034Z     
2025-05-07T20:32:59.8565263Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:32:59.8565700Z             op = silu_mul_quant
2025-05-07T20:32:59.8565949Z             if compiled:
2025-05-07T20:32:59.8566202Z                 op = torch.compile(op)
2025-05-07T20:32:59.8566494Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.8566777Z     
2025-05-07T20:32:59.8566974Z >       y_fp8, y_scale = fn()
2025-05-07T20:32:59.8567139Z 
2025-05-07T20:32:59.8567238Z moe/activation_test.py:117: 
2025-05-07T20:32:59.8567531Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.8567860Z moe/activation_test.py:115: in fn
2025-05-07T20:32:59.8568132Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:32:59.8568822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:32:59.8569500Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:32:59.8570033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:32:59.8570723Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:32:59.8571384Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:32:59.8571910Z     kernel = self.compile(
2025-05-07T20:32:59.8572456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:32:59.8573095Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:32:59.8573492Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:32:59.8573718Z 
2025-05-07T20:32:59.8573934Z self = <triton.compiler.compiler.ASTSource object at 0x7f3788f100d0>
2025-05-07T20:32:59.8574990Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:32:59.8576435Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f35d76fe8e0>}
2025-05-07T20:32:59.8577759Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:32:59.8578772Z context = <triton._C.libtriton.ir.context object at 0x7f35d771c270>
2025-05-07T20:32:59.8579051Z 
2025-05-07T20:32:59.8579221Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:32:59.8579731Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:32:59.8580217Z                            module_map=module_map)
2025-05-07T20:32:59.8580575Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:32:59.8580919Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:32:59.8581181Z E       ^
2025-05-07T20:32:59.8581644Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:32:59.8582141Z 
2025-05-07T20:32:59.8582557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:32:59.8583056Z 
2025-05-07T20:32:59.8583157Z Trying example: test_silu_mul_quant(
2025-05-07T20:32:59.8583562Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:32:59.8583958Z     T=4096,
2025-05-07T20:32:59.8584147Z     D=7168,
2025-05-07T20:32:59.8584334Z     scale_ub=1200.0,
2025-05-07T20:32:59.8584561Z     contiguous=False,
2025-05-07T20:32:59.8584793Z     compiled=True,
2025-05-07T20:33:00.0203768Z )
2025-05-07T20:33:00.0204710Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.0206063Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:00.0206617Z 
2025-05-07T20:33:00.0206790Z     @given(
2025-05-07T20:33:00.0207240Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.0207854Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.0208454Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.0209091Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.0209729Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.0210288Z     )
2025-05-07T20:33:00.0210963Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.0211509Z     def test_silu_mul_quant(
2025-05-07T20:33:00.0211753Z         self,
2025-05-07T20:33:00.0211941Z         T: int,
2025-05-07T20:33:00.0212139Z         D: int,
2025-05-07T20:33:00.0212362Z         scale_ub: Optional[float],
2025-05-07T20:33:00.0212627Z         contiguous: bool,
2025-05-07T20:33:00.0212871Z         compiled: bool,
2025-05-07T20:33:00.0213104Z     ) -> None:
2025-05-07T20:33:00.0213314Z         torch.manual_seed(2025)
2025-05-07T20:33:00.0213560Z     
2025-05-07T20:33:00.0213832Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.0214173Z     
2025-05-07T20:33:00.0214364Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.0214651Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.0214962Z         x = x_sign * x_clamp
2025-05-07T20:33:00.0215199Z         x0 = x[:, :D]
2025-05-07T20:33:00.0215423Z         x1 = x[:, D:]
2025-05-07T20:33:00.0215640Z     
2025-05-07T20:33:00.0215820Z         if contiguous:
2025-05-07T20:33:00.0216049Z             x0 = x0.contiguous()
2025-05-07T20:33:00.0216304Z             x1 = x1.contiguous()
2025-05-07T20:33:00.0216542Z     
2025-05-07T20:33:00.0216732Z         if scale_ub is not None:
2025-05-07T20:33:00.0217009Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.0217339Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.0217647Z             )
2025-05-07T20:33:00.0217842Z         else:
2025-05-07T20:33:00.0218223Z             scale_ub_tensor = None
2025-05-07T20:33:00.0218482Z     
2025-05-07T20:33:00.0218714Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.0219019Z             op = silu_mul_quant
2025-05-07T20:33:00.0219273Z             if compiled:
2025-05-07T20:33:00.0219524Z                 op = torch.compile(op)
2025-05-07T20:33:00.0219820Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.0220089Z     
2025-05-07T20:33:00.0220283Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.0220444Z 
2025-05-07T20:33:00.0220550Z moe/activation_test.py:117: 
2025-05-07T20:33:00.0220838Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.0221169Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.0221454Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.0222008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:00.0222570Z     return fn(*args, **kwargs)
2025-05-07T20:33:00.0223304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.0223982Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.0224510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.0225191Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.0225848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.0226368Z     kernel = self.compile(
2025-05-07T20:33:00.0226924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.0227732Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.0228127Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.0228352Z 
2025-05-07T20:33:00.0228555Z self = <triton.compiler.compiler.ASTSource object at 0x7f35d7c29350>
2025-05-07T20:33:00.0229620Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.0230989Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f35d76ffa60>}
2025-05-07T20:33:00.0232365Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.0233373Z context = <triton._C.libtriton.ir.context object at 0x7f35d77af5f0>
2025-05-07T20:33:00.0233658Z 
2025-05-07T20:33:00.0233822Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.0234352Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.0234815Z                            module_map=module_map)
2025-05-07T20:33:00.0235175Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.0235526Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.0235786Z E       ^
2025-05-07T20:33:00.0236247Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.0236697Z 
2025-05-07T20:33:00.0237120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.0237633Z 
2025-05-07T20:33:00.0237736Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.0238223Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.0238629Z     T=128,
2025-05-07T20:33:00.0238816Z     D=7168,
2025-05-07T20:33:00.0239037Z     scale_ub=1200.0,
2025-05-07T20:33:00.0239269Z     contiguous=False,
2025-05-07T20:33:00.0239502Z     compiled=True,
2025-05-07T20:33:00.0239706Z )
2025-05-07T20:33:00.0240022Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.0240787Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:00.0241055Z 
2025-05-07T20:33:00.0241135Z     @given(
2025-05-07T20:33:00.0241371Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.0241684Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.0241984Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.0242311Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.0242643Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.0242926Z     )
2025-05-07T20:33:00.0243277Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.0243789Z     def test_silu_mul_quant(
2025-05-07T20:33:00.0244029Z         self,
2025-05-07T20:33:00.0244223Z         T: int,
2025-05-07T20:33:00.0244423Z         D: int,
2025-05-07T20:33:00.0244647Z         scale_ub: Optional[float],
2025-05-07T20:33:00.0244913Z         contiguous: bool,
2025-05-07T20:33:00.0245153Z         compiled: bool,
2025-05-07T20:33:00.0245377Z     ) -> None:
2025-05-07T20:33:00.0245588Z         torch.manual_seed(2025)
2025-05-07T20:33:00.0245832Z     
2025-05-07T20:33:00.0246101Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.0246429Z     
2025-05-07T20:33:00.0246618Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.0246908Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.0247284Z         x = x_sign * x_clamp
2025-05-07T20:33:00.0247527Z         x0 = x[:, :D]
2025-05-07T20:33:00.0247749Z         x1 = x[:, D:]
2025-05-07T20:33:00.0247959Z     
2025-05-07T20:33:00.0248152Z         if contiguous:
2025-05-07T20:33:00.0248388Z             x0 = x0.contiguous()
2025-05-07T20:33:00.0248649Z             x1 = x1.contiguous()
2025-05-07T20:33:00.0248884Z     
2025-05-07T20:33:00.0249074Z         if scale_ub is not None:
2025-05-07T20:33:00.0249343Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.0249674Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.0249986Z             )
2025-05-07T20:33:00.0250180Z         else:
2025-05-07T20:33:00.0250387Z             scale_ub_tensor = None
2025-05-07T20:33:00.0250639Z     
2025-05-07T20:33:00.0250874Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.0251185Z             op = silu_mul_quant
2025-05-07T20:33:00.0251450Z             if compiled:
2025-05-07T20:33:00.0251709Z                 op = torch.compile(op)
2025-05-07T20:33:00.0252000Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.0252280Z     
2025-05-07T20:33:00.0252481Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.0252668Z 
2025-05-07T20:33:00.0252767Z moe/activation_test.py:117: 
2025-05-07T20:33:00.0253065Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.0253388Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.0253673Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.0254236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:00.0254792Z     return fn(*args, **kwargs)
2025-05-07T20:33:00.0255439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.0256124Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.0256671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.0257510Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.0258185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.0258759Z     kernel = self.compile(
2025-05-07T20:33:00.0259319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.0259974Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.0260384Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.0260611Z 
2025-05-07T20:33:00.0260837Z self = <triton.compiler.compiler.ASTSource object at 0x7f37ae8dcbd0>
2025-05-07T20:33:00.0261907Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.0263261Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f35d74d4ea0>}
2025-05-07T20:33:00.0264677Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.0265689Z context = <triton._C.libtriton.ir.context object at 0x7f35d7422c30>
2025-05-07T20:33:00.0265970Z 
2025-05-07T20:33:00.0266148Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.0266662Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.0267131Z                            module_map=module_map)
2025-05-07T20:33:00.0267634Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.0267990Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.0268252Z E       ^
2025-05-07T20:33:00.0268723Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.0269174Z 
2025-05-07T20:33:00.0269595Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.0270109Z 
2025-05-07T20:33:00.0270219Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.0270621Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.0271021Z     T=2048,
2025-05-07T20:33:00.0271212Z     D=7168,
2025-05-07T20:33:00.0271404Z     scale_ub=None,
2025-05-07T20:33:00.0271627Z     contiguous=True,
2025-05-07T20:33:00.0271851Z     compiled=True,
2025-05-07T20:33:00.1498503Z )
2025-05-07T20:33:00.1499010Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.1499683Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:00.1500067Z 
2025-05-07T20:33:00.1500170Z     @given(
2025-05-07T20:33:00.1500400Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.1500712Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.1501025Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.1501348Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.1501674Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.1501967Z     )
2025-05-07T20:33:00.1502313Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.1502757Z     def test_silu_mul_quant(
2025-05-07T20:33:00.1503011Z         self,
2025-05-07T20:33:00.1503209Z         T: int,
2025-05-07T20:33:00.1503405Z         D: int,
2025-05-07T20:33:00.1503629Z         scale_ub: Optional[float],
2025-05-07T20:33:00.1503905Z         contiguous: bool,
2025-05-07T20:33:00.1504147Z         compiled: bool,
2025-05-07T20:33:00.1504382Z     ) -> None:
2025-05-07T20:33:00.1504918Z         torch.manual_seed(2025)
2025-05-07T20:33:00.1505164Z     
2025-05-07T20:33:00.1505437Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.1505781Z     
2025-05-07T20:33:00.1505970Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.1506264Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.1506578Z         x = x_sign * x_clamp
2025-05-07T20:33:00.1506815Z         x0 = x[:, :D]
2025-05-07T20:33:00.1507041Z         x1 = x[:, D:]
2025-05-07T20:33:00.1507255Z     
2025-05-07T20:33:00.1507568Z         if contiguous:
2025-05-07T20:33:00.1507808Z             x0 = x0.contiguous()
2025-05-07T20:33:00.1508057Z             x1 = x1.contiguous()
2025-05-07T20:33:00.1508298Z     
2025-05-07T20:33:00.1508492Z         if scale_ub is not None:
2025-05-07T20:33:00.1508763Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.1509099Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.1509414Z             )
2025-05-07T20:33:00.1509603Z         else:
2025-05-07T20:33:00.1509896Z             scale_ub_tensor = None
2025-05-07T20:33:00.1510150Z     
2025-05-07T20:33:00.1510373Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.1510685Z             op = silu_mul_quant
2025-05-07T20:33:00.1510938Z             if compiled:
2025-05-07T20:33:00.1511186Z                 op = torch.compile(op)
2025-05-07T20:33:00.1511480Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.1511751Z     
2025-05-07T20:33:00.1511946Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.1512108Z 
2025-05-07T20:33:00.1512211Z moe/activation_test.py:117: 
2025-05-07T20:33:00.1512503Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.1512833Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.1513194Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.1513759Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:00.1514321Z     return fn(*args, **kwargs)
2025-05-07T20:33:00.1515021Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.1515695Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.1516232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.1516904Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.1517558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.1518091Z     kernel = self.compile(
2025-05-07T20:33:00.1518648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.1519319Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.1519722Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.1519959Z 
2025-05-07T20:33:00.1520166Z self = <triton.compiler.compiler.ASTSource object at 0x7f3788f105d0>
2025-05-07T20:33:00.1521264Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.1522635Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f35d74d5c60>}
2025-05-07T20:33:00.1523951Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.1525052Z context = <triton._C.libtriton.ir.context object at 0x7f35d7482330>
2025-05-07T20:33:00.1525342Z 
2025-05-07T20:33:00.1525505Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.1526029Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.1526488Z                            module_map=module_map)
2025-05-07T20:33:00.1526862Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.1527216Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.1527483Z E       ^
2025-05-07T20:33:00.1527937Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.1528390Z 
2025-05-07T20:33:00.1528811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.1529315Z 
2025-05-07T20:33:00.1529422Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.1529833Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.1530282Z     T=16384,
2025-05-07T20:33:00.1530475Z     D=5120,
2025-05-07T20:33:00.1530670Z     scale_ub=None,
2025-05-07T20:33:00.1530878Z     contiguous=False,
2025-05-07T20:33:00.1531100Z     compiled=False,
2025-05-07T20:33:00.1531307Z )
2025-05-07T20:33:00.1531617Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.1532108Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:00.1532386Z 
2025-05-07T20:33:00.1532472Z     @given(
2025-05-07T20:33:00.1532691Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.1533003Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.1533306Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.1533676Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.1533993Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.1534282Z     )
2025-05-07T20:33:00.1534629Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.1535066Z     def test_silu_mul_quant(
2025-05-07T20:33:00.1535306Z         self,
2025-05-07T20:33:00.1535496Z         T: int,
2025-05-07T20:33:00.1535682Z         D: int,
2025-05-07T20:33:00.1535903Z         scale_ub: Optional[float],
2025-05-07T20:33:00.1536176Z         contiguous: bool,
2025-05-07T20:33:00.1536406Z         compiled: bool,
2025-05-07T20:33:00.1536627Z     ) -> None:
2025-05-07T20:33:00.1536847Z         torch.manual_seed(2025)
2025-05-07T20:33:00.1537078Z     
2025-05-07T20:33:00.1537344Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.1537688Z     
2025-05-07T20:33:00.1537877Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.1538172Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.1540600Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.1542464Z 
2025-05-07T20:33:00.1542583Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:00.1542798Z 
2025-05-07T20:33:00.1542900Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.1543317Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.1543710Z     T=4096,
2025-05-07T20:33:00.1543898Z     D=7168,
2025-05-07T20:33:00.1544093Z     scale_ub=1200.0,
2025-05-07T20:33:00.1544316Z     contiguous=True,
2025-05-07T20:33:00.1544679Z     compiled=True,
2025-05-07T20:33:00.1544887Z )
2025-05-07T20:33:00.1545200Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.1545687Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:00.1545967Z 
2025-05-07T20:33:00.1546047Z     @given(
2025-05-07T20:33:00.1546271Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.1546576Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.1546877Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.1547200Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.1547569Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.1547846Z     )
2025-05-07T20:33:00.1548188Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.1548618Z     def test_silu_mul_quant(
2025-05-07T20:33:00.1548857Z         self,
2025-05-07T20:33:00.1549046Z         T: int,
2025-05-07T20:33:00.1549239Z         D: int,
2025-05-07T20:33:00.1549521Z         scale_ub: Optional[float],
2025-05-07T20:33:00.1549790Z         contiguous: bool,
2025-05-07T20:33:00.1550025Z         compiled: bool,
2025-05-07T20:33:00.1550237Z     ) -> None:
2025-05-07T20:33:00.1550448Z         torch.manual_seed(2025)
2025-05-07T20:33:00.1550691Z     
2025-05-07T20:33:00.1550951Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.1551299Z     
2025-05-07T20:33:00.1551487Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.1551769Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.1553746Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.1555703Z 
2025-05-07T20:33:00.1555819Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:00.1556032Z 
2025-05-07T20:33:00.1556136Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.1556546Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.1556955Z     T=16384,
2025-05-07T20:33:00.1557143Z     D=7168,
2025-05-07T20:33:00.1557328Z     scale_ub=None,
2025-05-07T20:33:00.1557537Z     contiguous=False,
2025-05-07T20:33:00.1557759Z     compiled=False,
2025-05-07T20:33:00.1557958Z )
2025-05-07T20:33:00.1558262Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.1559100Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:00.1559409Z 
2025-05-07T20:33:00.1559522Z     @given(
2025-05-07T20:33:00.1568159Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.1568516Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.1568826Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.1569154Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.1569471Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.1569754Z     )
2025-05-07T20:33:00.1570111Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.1570555Z     def test_silu_mul_quant(
2025-05-07T20:33:00.1570799Z         self,
2025-05-07T20:33:00.1570993Z         T: int,
2025-05-07T20:33:00.1571184Z         D: int,
2025-05-07T20:33:00.1571401Z         scale_ub: Optional[float],
2025-05-07T20:33:00.1571678Z         contiguous: bool,
2025-05-07T20:33:00.1571919Z         compiled: bool,
2025-05-07T20:33:00.1572141Z     ) -> None:
2025-05-07T20:33:00.1572483Z         torch.manual_seed(2025)
2025-05-07T20:33:00.1572732Z     
2025-05-07T20:33:00.1573005Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.1575087Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.1576947Z 
2025-05-07T20:33:00.1577069Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:00.2811463Z 
2025-05-07T20:33:00.2811804Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.2812472Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.2813387Z     T=2048,
2025-05-07T20:33:00.2813630Z     D=7168,
2025-05-07T20:33:00.2813820Z     scale_ub=1200.0,
2025-05-07T20:33:00.2814044Z     contiguous=True,
2025-05-07T20:33:00.2814270Z     compiled=True,
2025-05-07T20:33:00.2814475Z )
2025-05-07T20:33:00.2814794Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.2815288Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:00.2815556Z 
2025-05-07T20:33:00.2815638Z     @given(
2025-05-07T20:33:00.2815868Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.2816183Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.2816487Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.2816920Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.2817250Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.2817539Z     )
2025-05-07T20:33:00.2817906Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.2818370Z     def test_silu_mul_quant(
2025-05-07T20:33:00.2818620Z         self,
2025-05-07T20:33:00.2818808Z         T: int,
2025-05-07T20:33:00.2819014Z         D: int,
2025-05-07T20:33:00.2819247Z         scale_ub: Optional[float],
2025-05-07T20:33:00.2819520Z         contiguous: bool,
2025-05-07T20:33:00.2819763Z         compiled: bool,
2025-05-07T20:33:00.2819993Z     ) -> None:
2025-05-07T20:33:00.2820205Z         torch.manual_seed(2025)
2025-05-07T20:33:00.2820451Z     
2025-05-07T20:33:00.2820724Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.2821063Z     
2025-05-07T20:33:00.2821257Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.2821555Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.2823569Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.2825507Z 
2025-05-07T20:33:00.2825630Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:00.2825843Z 
2025-05-07T20:33:00.2825945Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.2826364Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.2826780Z     T=2048,
2025-05-07T20:33:00.2826975Z     D=7168,
2025-05-07T20:33:00.2827160Z     scale_ub=None,
2025-05-07T20:33:00.2827373Z     contiguous=True,
2025-05-07T20:33:00.2827695Z     compiled=False,
2025-05-07T20:33:00.2828052Z )
2025-05-07T20:33:00.2828373Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.2828862Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:00.2829125Z 
2025-05-07T20:33:00.2829204Z     @given(
2025-05-07T20:33:00.2829431Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.2829746Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.2830045Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.2830375Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.2830698Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.2830981Z     )
2025-05-07T20:33:00.2831325Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.2831770Z     def test_silu_mul_quant(
2025-05-07T20:33:00.2832020Z         self,
2025-05-07T20:33:00.2832210Z         T: int,
2025-05-07T20:33:00.2832406Z         D: int,
2025-05-07T20:33:00.2832634Z         scale_ub: Optional[float],
2025-05-07T20:33:00.2832949Z         contiguous: bool,
2025-05-07T20:33:00.2833196Z         compiled: bool,
2025-05-07T20:33:00.2833427Z     ) -> None:
2025-05-07T20:33:00.2833634Z         torch.manual_seed(2025)
2025-05-07T20:33:00.2833882Z     
2025-05-07T20:33:00.2834150Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.2834493Z     
2025-05-07T20:33:00.2834689Z >       x_sign = torch.sign(x)
2025-05-07T20:33:00.2836603Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.2838571Z 
2025-05-07T20:33:00.2838692Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:00.2838901Z 
2025-05-07T20:33:00.2839011Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.2839419Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.2839824Z     T=1,
2025-05-07T20:33:00.2840011Z     D=7168,
2025-05-07T20:33:00.2840458Z     scale_ub=1200.0,
2025-05-07T20:33:00.2840682Z     contiguous=True,
2025-05-07T20:33:00.2840907Z     compiled=False,
2025-05-07T20:33:00.2841107Z )
2025-05-07T20:33:00.2841427Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.2841909Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:00.2842181Z 
2025-05-07T20:33:00.2842264Z     @given(
2025-05-07T20:33:00.2842485Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.2842811Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.2843121Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.2843441Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.2843771Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.2844056Z     )
2025-05-07T20:33:00.2844404Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.2844842Z     def test_silu_mul_quant(
2025-05-07T20:33:00.2845089Z         self,
2025-05-07T20:33:00.2845280Z         T: int,
2025-05-07T20:33:00.2845487Z         D: int,
2025-05-07T20:33:00.2845713Z         scale_ub: Optional[float],
2025-05-07T20:33:00.2845983Z         contiguous: bool,
2025-05-07T20:33:00.2846223Z         compiled: bool,
2025-05-07T20:33:00.2846455Z     ) -> None:
2025-05-07T20:33:00.2846668Z         torch.manual_seed(2025)
2025-05-07T20:33:00.2846910Z     
2025-05-07T20:33:00.2847305Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.2847655Z     
2025-05-07T20:33:00.2847847Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.2848140Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.2848453Z         x = x_sign * x_clamp
2025-05-07T20:33:00.2848691Z         x0 = x[:, :D]
2025-05-07T20:33:00.2848910Z         x1 = x[:, D:]
2025-05-07T20:33:00.2849124Z     
2025-05-07T20:33:00.2849304Z         if contiguous:
2025-05-07T20:33:00.2849547Z             x0 = x0.contiguous()
2025-05-07T20:33:00.2849813Z             x1 = x1.contiguous()
2025-05-07T20:33:00.2850054Z     
2025-05-07T20:33:00.2850248Z         if scale_ub is not None:
2025-05-07T20:33:00.2850530Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.2850863Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.2851193Z             )
2025-05-07T20:33:00.2851393Z         else:
2025-05-07T20:33:00.2851614Z             scale_ub_tensor = None
2025-05-07T20:33:00.2851863Z     
2025-05-07T20:33:00.2852112Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.2852496Z             op = silu_mul_quant
2025-05-07T20:33:00.2852755Z             if compiled:
2025-05-07T20:33:00.2853005Z                 op = torch.compile(op)
2025-05-07T20:33:00.2853312Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.2853592Z     
2025-05-07T20:33:00.2853791Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.2853955Z 
2025-05-07T20:33:00.2854064Z moe/activation_test.py:117: 
2025-05-07T20:33:00.2854356Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.2854695Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.2854976Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.2855662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.2856416Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.2856967Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.2857652Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.2858305Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.2858837Z     kernel = self.compile(
2025-05-07T20:33:00.2859389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.2860045Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.2860442Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.2860697Z 
2025-05-07T20:33:00.2860905Z self = <triton.compiler.compiler.ASTSource object at 0x7f37890cf2d0>
2025-05-07T20:33:00.2861981Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.2863341Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f35d7504b80>}
2025-05-07T20:33:00.2864666Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.2865677Z context = <triton._C.libtriton.ir.context object at 0x7f35d75c9470>
2025-05-07T20:33:00.2865957Z 
2025-05-07T20:33:00.2866123Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.2866644Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.2867193Z                            module_map=module_map)
2025-05-07T20:33:00.2867626Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.2867982Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.2868240Z E       ^
2025-05-07T20:33:00.2868703Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.2869159Z 
2025-05-07T20:33:00.2869581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.2870095Z 
2025-05-07T20:33:00.2870197Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.2870611Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.2871001Z     T=128,
2025-05-07T20:33:00.2871187Z     D=5120,
2025-05-07T20:33:00.2871381Z     scale_ub=None,
2025-05-07T20:33:00.2871590Z     contiguous=True,
2025-05-07T20:33:00.2871810Z     compiled=False,
2025-05-07T20:33:00.2872013Z )
2025-05-07T20:33:00.2872336Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.2872942Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:00.2873208Z 
2025-05-07T20:33:00.2873288Z     @given(
2025-05-07T20:33:00.2873516Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.2873822Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.2874128Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.2874474Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.2874800Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.2875077Z     )
2025-05-07T20:33:00.2875422Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.2875859Z     def test_silu_mul_quant(
2025-05-07T20:33:00.2876150Z         self,
2025-05-07T20:33:00.2876350Z         T: int,
2025-05-07T20:33:00.2876550Z         D: int,
2025-05-07T20:33:00.2876766Z         scale_ub: Optional[float],
2025-05-07T20:33:00.2877039Z         contiguous: bool,
2025-05-07T20:33:00.2877283Z         compiled: bool,
2025-05-07T20:33:00.2877503Z     ) -> None:
2025-05-07T20:33:00.2877717Z         torch.manual_seed(2025)
2025-05-07T20:33:00.2877958Z     
2025-05-07T20:33:00.2878231Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.2878583Z     
2025-05-07T20:33:00.2878787Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.2879084Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.2879390Z         x = x_sign * x_clamp
2025-05-07T20:33:00.2879631Z         x0 = x[:, :D]
2025-05-07T20:33:00.2879846Z         x1 = x[:, D:]
2025-05-07T20:33:00.2880045Z     
2025-05-07T20:33:00.2880227Z         if contiguous:
2025-05-07T20:33:00.2880463Z             x0 = x0.contiguous()
2025-05-07T20:33:00.2880713Z             x1 = x1.contiguous()
2025-05-07T20:33:00.2880949Z     
2025-05-07T20:33:00.2881138Z         if scale_ub is not None:
2025-05-07T20:33:00.2881408Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.2881741Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.2882045Z             )
2025-05-07T20:33:00.2882228Z         else:
2025-05-07T20:33:00.2882438Z             scale_ub_tensor = None
2025-05-07T20:33:00.2882689Z     
2025-05-07T20:33:00.2882912Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.2883222Z             op = silu_mul_quant
2025-05-07T20:33:00.2883471Z             if compiled:
2025-05-07T20:33:00.2883718Z                 op = torch.compile(op)
2025-05-07T20:33:00.2884010Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.2884284Z     
2025-05-07T20:33:00.2884478Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.2884641Z 
2025-05-07T20:33:00.2884740Z moe/activation_test.py:117: 
2025-05-07T20:33:00.2885027Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.2885441Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.2885717Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.2886423Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.2887103Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.2887639Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.2888314Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.2888989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.2889532Z     kernel = self.compile(
2025-05-07T20:33:00.2890085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.2890735Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.2891131Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.2891408Z 
2025-05-07T20:33:00.2891616Z self = <triton.compiler.compiler.ASTSource object at 0x7f3789032250>
2025-05-07T20:33:00.2892674Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.2894026Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f35d7505a80>}
2025-05-07T20:33:00.2895346Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.2896404Z context = <triton._C.libtriton.ir.context object at 0x7f35d75136b0>
2025-05-07T20:33:00.2896692Z 
2025-05-07T20:33:00.2896870Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.2897389Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.2897873Z                            module_map=module_map)
2025-05-07T20:33:00.2898237Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.2898581Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.2898841Z E       ^
2025-05-07T20:33:00.2899307Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.2899753Z 
2025-05-07T20:33:00.2900184Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.4064340Z 
2025-05-07T20:33:00.4064978Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.4066170Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.4067301Z     T=128,
2025-05-07T20:33:00.4067979Z     D=7168,
2025-05-07T20:33:00.4068472Z     scale_ub=None,
2025-05-07T20:33:00.4069050Z     contiguous=True,
2025-05-07T20:33:00.4069645Z     compiled=False,
2025-05-07T20:33:00.4070126Z )
2025-05-07T20:33:00.4070768Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.4071649Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:00.4071923Z 
2025-05-07T20:33:00.4072001Z     @given(
2025-05-07T20:33:00.4072234Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.4072548Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.4072850Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.4073190Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.4073513Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.4073796Z     )
2025-05-07T20:33:00.4074466Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.4074902Z     def test_silu_mul_quant(
2025-05-07T20:33:00.4075145Z         self,
2025-05-07T20:33:00.4075342Z         T: int,
2025-05-07T20:33:00.4075541Z         D: int,
2025-05-07T20:33:00.4075754Z         scale_ub: Optional[float],
2025-05-07T20:33:00.4076023Z         contiguous: bool,
2025-05-07T20:33:00.4076263Z         compiled: bool,
2025-05-07T20:33:00.4076484Z     ) -> None:
2025-05-07T20:33:00.4076704Z         torch.manual_seed(2025)
2025-05-07T20:33:00.4076946Z     
2025-05-07T20:33:00.4077213Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.4077560Z     
2025-05-07T20:33:00.4077754Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.4078047Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.4078355Z         x = x_sign * x_clamp
2025-05-07T20:33:00.4078596Z         x0 = x[:, :D]
2025-05-07T20:33:00.4078811Z         x1 = x[:, D:]
2025-05-07T20:33:00.4079109Z     
2025-05-07T20:33:00.4079294Z         if contiguous:
2025-05-07T20:33:00.4079519Z             x0 = x0.contiguous()
2025-05-07T20:33:00.4079775Z             x1 = x1.contiguous()
2025-05-07T20:33:00.4080018Z     
2025-05-07T20:33:00.4080202Z         if scale_ub is not None:
2025-05-07T20:33:00.4080473Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.4080804Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.4081112Z             )
2025-05-07T20:33:00.4081303Z         else:
2025-05-07T20:33:00.4081516Z             scale_ub_tensor = None
2025-05-07T20:33:00.4081763Z     
2025-05-07T20:33:00.4081989Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.4082297Z             op = silu_mul_quant
2025-05-07T20:33:00.4082631Z             if compiled:
2025-05-07T20:33:00.4082870Z                 op = torch.compile(op)
2025-05-07T20:33:00.4083166Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.4083444Z     
2025-05-07T20:33:00.4083630Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.4083800Z 
2025-05-07T20:33:00.4083897Z moe/activation_test.py:117: 
2025-05-07T20:33:00.4084189Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.4084526Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.4084803Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.4085491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.4086173Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.4086707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.4087397Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.4088079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.4088605Z     kernel = self.compile(
2025-05-07T20:33:00.4089137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.4089787Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.4090178Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.4090401Z 
2025-05-07T20:33:00.4090606Z self = <triton.compiler.compiler.ASTSource object at 0x7f37890cf7d0>
2025-05-07T20:33:00.4091679Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.4093130Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f35d7506980>}
2025-05-07T20:33:00.4094451Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.4095462Z context = <triton._C.libtriton.ir.context object at 0x7f35d720f670>
2025-05-07T20:33:00.4095742Z 
2025-05-07T20:33:00.4095905Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.4096426Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.4096905Z                            module_map=module_map)
2025-05-07T20:33:00.4097278Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.4097621Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.4097886Z E       ^
2025-05-07T20:33:00.4098351Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.4098797Z 
2025-05-07T20:33:00.4099273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.4099782Z 
2025-05-07T20:33:00.4099886Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.4100298Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.4100704Z     T=2048,
2025-05-07T20:33:00.4100889Z     D=7168,
2025-05-07T20:33:00.4101086Z     scale_ub=1200.0,
2025-05-07T20:33:00.4101315Z     contiguous=True,
2025-05-07T20:33:00.4101553Z     compiled=False,
2025-05-07T20:33:00.4101791Z )
2025-05-07T20:33:00.4102118Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.4102600Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:00.4102920Z 
2025-05-07T20:33:00.4103000Z     @given(
2025-05-07T20:33:00.4103229Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.4103537Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.4103843Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.4104171Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.4104500Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.4104776Z     )
2025-05-07T20:33:00.4105130Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.4105565Z     def test_silu_mul_quant(
2025-05-07T20:33:00.4105800Z         self,
2025-05-07T20:33:00.4105997Z         T: int,
2025-05-07T20:33:00.4106197Z         D: int,
2025-05-07T20:33:00.4106409Z         scale_ub: Optional[float],
2025-05-07T20:33:00.4106678Z         contiguous: bool,
2025-05-07T20:33:00.4106925Z         compiled: bool,
2025-05-07T20:33:00.4107150Z     ) -> None:
2025-05-07T20:33:00.4107365Z         torch.manual_seed(2025)
2025-05-07T20:33:00.4107677Z     
2025-05-07T20:33:00.4107947Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.4109974Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.4111941Z 
2025-05-07T20:33:00.4112061Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:00.4112274Z 
2025-05-07T20:33:00.4112375Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.4112785Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.4113176Z     T=1,
2025-05-07T20:33:00.4113479Z     D=5120,
2025-05-07T20:33:00.4113672Z     scale_ub=1200.0,
2025-05-07T20:33:00.4113892Z     contiguous=True,
2025-05-07T20:33:00.4114115Z     compiled=False,
2025-05-07T20:33:00.4114323Z )
2025-05-07T20:33:00.4114635Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.4115137Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:00.4115403Z 
2025-05-07T20:33:00.4115483Z     @given(
2025-05-07T20:33:00.4115765Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.4116286Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.4116689Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.4117098Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.4126166Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.4126479Z     )
2025-05-07T20:33:00.4126836Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.4127299Z     def test_silu_mul_quant(
2025-05-07T20:33:00.4127628Z         self,
2025-05-07T20:33:00.4127836Z         T: int,
2025-05-07T20:33:00.4128035Z         D: int,
2025-05-07T20:33:00.4128265Z         scale_ub: Optional[float],
2025-05-07T20:33:00.4128545Z         contiguous: bool,
2025-05-07T20:33:00.4128788Z         compiled: bool,
2025-05-07T20:33:00.4129022Z     ) -> None:
2025-05-07T20:33:00.4129246Z         torch.manual_seed(2025)
2025-05-07T20:33:00.4129489Z     
2025-05-07T20:33:00.4129766Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.4130112Z     
2025-05-07T20:33:00.4130308Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.4130594Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.4130902Z         x = x_sign * x_clamp
2025-05-07T20:33:00.4131191Z         x0 = x[:, :D]
2025-05-07T20:33:00.4131434Z         x1 = x[:, D:]
2025-05-07T20:33:00.4131675Z     
2025-05-07T20:33:00.4131863Z         if contiguous:
2025-05-07T20:33:00.4132101Z             x0 = x0.contiguous()
2025-05-07T20:33:00.4132366Z             x1 = x1.contiguous()
2025-05-07T20:33:00.4132613Z     
2025-05-07T20:33:00.4132803Z         if scale_ub is not None:
2025-05-07T20:33:00.4133081Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.4133419Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.4133729Z             )
2025-05-07T20:33:00.4133930Z         else:
2025-05-07T20:33:00.4134145Z             scale_ub_tensor = None
2025-05-07T20:33:00.4134401Z     
2025-05-07T20:33:00.4134635Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.4134951Z             op = silu_mul_quant
2025-05-07T20:33:00.4135205Z             if compiled:
2025-05-07T20:33:00.4135452Z                 op = torch.compile(op)
2025-05-07T20:33:00.4135761Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.4136043Z     
2025-05-07T20:33:00.4136238Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.4136409Z 
2025-05-07T20:33:00.4136516Z moe/activation_test.py:117: 
2025-05-07T20:33:00.4136817Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.4137145Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.4137431Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.4138120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.4138805Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.4139341Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.4140023Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.4140944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.4141473Z     kernel = self.compile(
2025-05-07T20:33:00.4142169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.4142834Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.4143250Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.4143485Z 
2025-05-07T20:33:00.4143690Z self = <triton.compiler.compiler.ASTSource object at 0x7f37ae8df8d0>
2025-05-07T20:33:00.4144763Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.4146122Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f35d7507e20>}
2025-05-07T20:33:00.4147511Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.4148595Z context = <triton._C.libtriton.ir.context object at 0x7f35d73b7870>
2025-05-07T20:33:00.4148883Z 
2025-05-07T20:33:00.4149050Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.4149573Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.4150049Z                            module_map=module_map)
2025-05-07T20:33:00.4150412Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.4150770Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.4151030Z E       ^
2025-05-07T20:33:00.4151490Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.4152023Z 
2025-05-07T20:33:00.4152459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.4960613Z 
2025-05-07T20:33:00.4961060Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.4961668Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.4962203Z     T=2048,
2025-05-07T20:33:00.4962442Z     D=5120,
2025-05-07T20:33:00.4962681Z     scale_ub=None,
2025-05-07T20:33:00.4962936Z     contiguous=True,
2025-05-07T20:33:00.4963216Z     compiled=False,
2025-05-07T20:33:00.4963473Z )
2025-05-07T20:33:00.4963808Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.4964296Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:00.4964576Z 
2025-05-07T20:33:00.4964654Z     @given(
2025-05-07T20:33:00.4964895Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.4965197Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.4965505Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.4965828Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.4966143Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.4966420Z     )
2025-05-07T20:33:00.4966763Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.4967188Z     def test_silu_mul_quant(
2025-05-07T20:33:00.4967422Z         self,
2025-05-07T20:33:00.4967609Z         T: int,
2025-05-07T20:33:00.4967795Z         D: int,
2025-05-07T20:33:00.4968002Z         scale_ub: Optional[float],
2025-05-07T20:33:00.4968264Z         contiguous: bool,
2025-05-07T20:33:00.4968496Z         compiled: bool,
2025-05-07T20:33:00.4968707Z     ) -> None:
2025-05-07T20:33:00.4968912Z         torch.manual_seed(2025)
2025-05-07T20:33:00.4969151Z     
2025-05-07T20:33:00.4969413Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.4969749Z     
2025-05-07T20:33:00.4969934Z >       x_sign = torch.sign(x)
2025-05-07T20:33:00.4972147Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.4974020Z 
2025-05-07T20:33:00.4974141Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:00.4974346Z 
2025-05-07T20:33:00.4974442Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.4974850Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.4975254Z     T=16384,
2025-05-07T20:33:00.4975439Z     D=5120,
2025-05-07T20:33:00.4975622Z     scale_ub=None,
2025-05-07T20:33:00.4975830Z     contiguous=True,
2025-05-07T20:33:00.4976106Z     compiled=False,
2025-05-07T20:33:00.4976301Z )
2025-05-07T20:33:00.4976614Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.4977101Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:00.4977370Z 
2025-05-07T20:33:00.4977447Z     @given(
2025-05-07T20:33:00.4977677Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.4977996Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.4978294Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.4978646Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.4978966Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.4979312Z     )
2025-05-07T20:33:00.4979646Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.4980097Z     def test_silu_mul_quant(
2025-05-07T20:33:00.4980335Z         self,
2025-05-07T20:33:00.4980528Z         T: int,
2025-05-07T20:33:00.4980713Z         D: int,
2025-05-07T20:33:00.4980932Z         scale_ub: Optional[float],
2025-05-07T20:33:00.4981210Z         contiguous: bool,
2025-05-07T20:33:00.4981436Z         compiled: bool,
2025-05-07T20:33:00.4981691Z     ) -> None:
2025-05-07T20:33:00.4981921Z         torch.manual_seed(2025)
2025-05-07T20:33:00.4982157Z     
2025-05-07T20:33:00.4982417Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.4984437Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.4986397Z 
2025-05-07T20:33:00.4986524Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:00.4986730Z 
2025-05-07T20:33:00.4986834Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.4987237Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.4987695Z     T=4096,
2025-05-07T20:33:00.4987877Z     D=5120,
2025-05-07T20:33:00.4988049Z     scale_ub=None,
2025-05-07T20:33:00.4988255Z     contiguous=True,
2025-05-07T20:33:00.4988470Z     compiled=False,
2025-05-07T20:33:00.4988658Z )
2025-05-07T20:33:00.4988970Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.4989453Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:00.4989713Z 
2025-05-07T20:33:00.4989790Z     @given(
2025-05-07T20:33:00.4990093Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.4990406Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.4990699Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.4991011Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.4991330Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.4991646Z     )
2025-05-07T20:33:00.4991989Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.4992418Z     def test_silu_mul_quant(
2025-05-07T20:33:00.4992651Z         self,
2025-05-07T20:33:00.4992832Z         T: int,
2025-05-07T20:33:00.4993022Z         D: int,
2025-05-07T20:33:00.4993233Z         scale_ub: Optional[float],
2025-05-07T20:33:00.4993498Z         contiguous: bool,
2025-05-07T20:33:00.4993742Z         compiled: bool,
2025-05-07T20:33:00.4993958Z     ) -> None:
2025-05-07T20:33:00.4994162Z         torch.manual_seed(2025)
2025-05-07T20:33:00.4994404Z     
2025-05-07T20:33:00.4994678Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.4996756Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.4998658Z 
2025-05-07T20:33:00.4998782Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:00.4998985Z 
2025-05-07T20:33:00.4999127Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.4999544Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.4999957Z     T=2048,
2025-05-07T20:33:00.5000139Z     D=5120,
2025-05-07T20:33:00.5000324Z     scale_ub=None,
2025-05-07T20:33:00.5000542Z     contiguous=False,
2025-05-07T20:33:00.5000771Z     compiled=False,
2025-05-07T20:33:00.5000970Z )
2025-05-07T20:33:00.5001279Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.5001770Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:00.5002044Z 
2025-05-07T20:33:00.5002121Z     @given(
2025-05-07T20:33:00.5002347Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.5002653Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.5002954Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.5003272Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.5003594Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.5003872Z     )
2025-05-07T20:33:00.5004215Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.5004644Z     def test_silu_mul_quant(
2025-05-07T20:33:00.5004889Z         self,
2025-05-07T20:33:00.5005070Z         T: int,
2025-05-07T20:33:00.5005267Z         D: int,
2025-05-07T20:33:00.5005476Z         scale_ub: Optional[float],
2025-05-07T20:33:00.5005728Z         contiguous: bool,
2025-05-07T20:33:00.5005968Z         compiled: bool,
2025-05-07T20:33:00.5006182Z     ) -> None:
2025-05-07T20:33:00.5006382Z         torch.manual_seed(2025)
2025-05-07T20:33:00.5006615Z     
2025-05-07T20:33:00.5006876Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.5008962Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.5010774Z 
2025-05-07T20:33:00.5010893Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:00.5011097Z 
2025-05-07T20:33:00.5011192Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.5011600Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.5012047Z     T=4096,
2025-05-07T20:33:00.5012225Z     D=7168,
2025-05-07T20:33:00.5012405Z     scale_ub=None,
2025-05-07T20:33:00.5012612Z     contiguous=True,
2025-05-07T20:33:00.5012820Z     compiled=True,
2025-05-07T20:33:00.5013013Z )
2025-05-07T20:33:00.5013320Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.5013807Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:00.5014064Z 
2025-05-07T20:33:00.5014143Z     @given(
2025-05-07T20:33:00.5014409Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.5014709Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.5014994Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.5015310Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.5015632Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.5015902Z     )
2025-05-07T20:33:00.5016238Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.5016670Z     def test_silu_mul_quant(
2025-05-07T20:33:00.5016907Z         self,
2025-05-07T20:33:00.5017089Z         T: int,
2025-05-07T20:33:00.5017290Z         D: int,
2025-05-07T20:33:00.5017504Z         scale_ub: Optional[float],
2025-05-07T20:33:00.5017811Z         contiguous: bool,
2025-05-07T20:33:00.5018052Z         compiled: bool,
2025-05-07T20:33:00.5018273Z     ) -> None:
2025-05-07T20:33:00.5018482Z         torch.manual_seed(2025)
2025-05-07T20:33:00.5018716Z     
2025-05-07T20:33:00.5018985Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.5021009Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.5022931Z 
2025-05-07T20:33:00.5023043Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:00.5023256Z 
2025-05-07T20:33:00.5023354Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.5023767Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.5024166Z     T=2048,
2025-05-07T20:33:00.5024343Z     D=5120,
2025-05-07T20:33:00.5024532Z     scale_ub=1200.0,
2025-05-07T20:33:00.5024754Z     contiguous=False,
2025-05-07T20:33:00.5024969Z     compiled=False,
2025-05-07T20:33:00.5576229Z )
2025-05-07T20:33:00.5577105Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.5578158Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:00.5578691Z 
2025-05-07T20:33:00.5578848Z     @given(
2025-05-07T20:33:00.5579277Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.5579886Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.5580482Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.5581117Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.5581564Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.5581841Z     )
2025-05-07T20:33:00.5582348Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.5582782Z     def test_silu_mul_quant(
2025-05-07T20:33:00.5583025Z         self,
2025-05-07T20:33:00.5583206Z         T: int,
2025-05-07T20:33:00.5583395Z         D: int,
2025-05-07T20:33:00.5583611Z         scale_ub: Optional[float],
2025-05-07T20:33:00.5583876Z         contiguous: bool,
2025-05-07T20:33:00.5584112Z         compiled: bool,
2025-05-07T20:33:00.5584335Z     ) -> None:
2025-05-07T20:33:00.5584541Z         torch.manual_seed(2025)
2025-05-07T20:33:00.5584774Z     
2025-05-07T20:33:00.5585036Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.5587074Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.5589076Z 
2025-05-07T20:33:00.5589197Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:00.5589400Z 
2025-05-07T20:33:00.5589498Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.5589905Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.5590308Z     T=4096,
2025-05-07T20:33:00.5590494Z     D=7168,
2025-05-07T20:33:00.5590671Z     scale_ub=1200.0,
2025-05-07T20:33:00.5590887Z     contiguous=True,
2025-05-07T20:33:00.5591103Z     compiled=False,
2025-05-07T20:33:00.5591370Z )
2025-05-07T20:33:00.5591693Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.5592181Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:00.5592452Z 
2025-05-07T20:33:00.5592527Z     @given(
2025-05-07T20:33:00.5592753Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.5593059Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.5593352Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.5593676Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.5594004Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.5594295Z     )
2025-05-07T20:33:00.5594639Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.5595084Z     def test_silu_mul_quant(
2025-05-07T20:33:00.5595324Z         self,
2025-05-07T20:33:00.5595508Z         T: int,
2025-05-07T20:33:00.5595706Z         D: int,
2025-05-07T20:33:00.5595922Z         scale_ub: Optional[float],
2025-05-07T20:33:00.5596184Z         contiguous: bool,
2025-05-07T20:33:00.5596421Z         compiled: bool,
2025-05-07T20:33:00.5596645Z     ) -> None:
2025-05-07T20:33:00.5596855Z         torch.manual_seed(2025)
2025-05-07T20:33:00.5597098Z     
2025-05-07T20:33:00.5597364Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.5599395Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.5601259Z 
2025-05-07T20:33:00.5601379Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:00.5601586Z 
2025-05-07T20:33:00.5601765Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.5602175Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.5602577Z     T=16384,
2025-05-07T20:33:00.5602758Z     D=7168,
2025-05-07T20:33:00.5602942Z     scale_ub=None,
2025-05-07T20:33:00.5603153Z     contiguous=False,
2025-05-07T20:33:00.5603369Z     compiled=True,
2025-05-07T20:33:00.5603572Z )
2025-05-07T20:33:00.5603887Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.5604370Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:00.5604649Z 
2025-05-07T20:33:00.5604721Z     @given(
2025-05-07T20:33:00.5604939Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.5605252Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.5605542Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.5605862Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.5606185Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.5606505Z     )
2025-05-07T20:33:00.5606862Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.5607301Z     def test_silu_mul_quant(
2025-05-07T20:33:00.5607538Z         self,
2025-05-07T20:33:00.5607733Z         T: int,
2025-05-07T20:33:00.5607956Z         D: int,
2025-05-07T20:33:00.5608173Z         scale_ub: Optional[float],
2025-05-07T20:33:00.5608438Z         contiguous: bool,
2025-05-07T20:33:00.5608668Z         compiled: bool,
2025-05-07T20:33:00.5608881Z     ) -> None:
2025-05-07T20:33:00.5609089Z         torch.manual_seed(2025)
2025-05-07T20:33:00.5609330Z     
2025-05-07T20:33:00.5609597Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.5611619Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.5613565Z 
2025-05-07T20:33:00.5613691Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:00.5613894Z 
2025-05-07T20:33:00.5613994Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.5614404Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.5614803Z     T=4096,
2025-05-07T20:33:00.5614983Z     D=7168,
2025-05-07T20:33:00.5615171Z     scale_ub=None,
2025-05-07T20:33:00.5615386Z     contiguous=True,
2025-05-07T20:33:00.5615601Z     compiled=False,
2025-05-07T20:33:00.5615804Z )
2025-05-07T20:33:00.5616124Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.5616611Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:00.5616874Z 
2025-05-07T20:33:00.5616948Z     @given(
2025-05-07T20:33:00.5617173Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.5617472Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.5617763Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.5618085Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.5618406Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.5618681Z     )
2025-05-07T20:33:00.5619018Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.5619445Z     def test_silu_mul_quant(
2025-05-07T20:33:00.5619684Z         self,
2025-05-07T20:33:00.5619875Z         T: int,
2025-05-07T20:33:00.5620069Z         D: int,
2025-05-07T20:33:00.5620281Z         scale_ub: Optional[float],
2025-05-07T20:33:00.5620621Z         contiguous: bool,
2025-05-07T20:33:00.5620860Z         compiled: bool,
2025-05-07T20:33:00.5621079Z     ) -> None:
2025-05-07T20:33:00.5621281Z         torch.manual_seed(2025)
2025-05-07T20:33:00.5621522Z     
2025-05-07T20:33:00.5621808Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.5623838Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.5625686Z 
2025-05-07T20:33:00.5625806Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:00.5626016Z 
2025-05-07T20:33:00.5626156Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.5626556Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.5626955Z     T=16384,
2025-05-07T20:33:00.5627132Z     D=7168,
2025-05-07T20:33:00.5627312Z     scale_ub=None,
2025-05-07T20:33:00.5627564Z     contiguous=True,
2025-05-07T20:33:00.5627780Z     compiled=False,
2025-05-07T20:33:00.5627982Z )
2025-05-07T20:33:00.5628289Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.5628766Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:00.5629046Z 
2025-05-07T20:33:00.5629122Z     @given(
2025-05-07T20:33:00.5629339Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.5629694Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.5630281Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.5630702Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.5631061Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.5638734Z     )
2025-05-07T20:33:00.5639107Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.5639565Z     def test_silu_mul_quant(
2025-05-07T20:33:00.5639803Z         self,
2025-05-07T20:33:00.5639996Z         T: int,
2025-05-07T20:33:00.5640375Z         D: int,
2025-05-07T20:33:00.5640581Z         scale_ub: Optional[float],
2025-05-07T20:33:00.5640847Z         contiguous: bool,
2025-05-07T20:33:00.5641089Z         compiled: bool,
2025-05-07T20:33:00.5641302Z     ) -> None:
2025-05-07T20:33:00.5641508Z         torch.manual_seed(2025)
2025-05-07T20:33:00.5641741Z     
2025-05-07T20:33:00.5642009Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.5644049Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.5645917Z 
2025-05-07T20:33:00.5646032Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:00.5646249Z 
2025-05-07T20:33:00.5646348Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.5646761Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.5647156Z     T=16384,
2025-05-07T20:33:00.5647345Z     D=7168,
2025-05-07T20:33:00.5647537Z     scale_ub=1200.0,
2025-05-07T20:33:00.5647752Z     contiguous=True,
2025-05-07T20:33:00.5647968Z     compiled=False,
2025-05-07T20:33:00.5648164Z )
2025-05-07T20:33:00.5648624Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.5649130Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:00.5649407Z 
2025-05-07T20:33:00.5649491Z     @given(
2025-05-07T20:33:00.5649722Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.5650027Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.5650332Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.5650658Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.5650972Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.5651252Z     )
2025-05-07T20:33:00.5651593Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.5652027Z     def test_silu_mul_quant(
2025-05-07T20:33:00.5652258Z         self,
2025-05-07T20:33:00.5652442Z         T: int,
2025-05-07T20:33:00.5652626Z         D: int,
2025-05-07T20:33:00.5652856Z         scale_ub: Optional[float],
2025-05-07T20:33:00.5653181Z         contiguous: bool,
2025-05-07T20:33:00.5653403Z         compiled: bool,
2025-05-07T20:33:00.5653614Z     ) -> None:
2025-05-07T20:33:00.5653818Z         torch.manual_seed(2025)
2025-05-07T20:33:00.5654059Z     
2025-05-07T20:33:00.5654322Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.5656369Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.5658298Z 
2025-05-07T20:33:00.5658414Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:00.7467265Z 
2025-05-07T20:33:00.7467697Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.7468343Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.7468897Z     T=128,
2025-05-07T20:33:00.7469161Z     D=5120,
2025-05-07T20:33:00.7469426Z     scale_ub=1200.0,
2025-05-07T20:33:00.7469727Z     contiguous=False,
2025-05-07T20:33:00.7469998Z     compiled=False,
2025-05-07T20:33:00.7470218Z )
2025-05-07T20:33:00.7470533Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.7471048Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:00.7471335Z 
2025-05-07T20:33:00.7471416Z     @given(
2025-05-07T20:33:00.7471717Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.7472037Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.7472360Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.7472702Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.7473029Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.7473324Z     )
2025-05-07T20:33:00.7473686Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.7474139Z     def test_silu_mul_quant(
2025-05-07T20:33:00.7474401Z         self,
2025-05-07T20:33:00.7474607Z         T: int,
2025-05-07T20:33:00.7474845Z         D: int,
2025-05-07T20:33:00.7475073Z         scale_ub: Optional[float],
2025-05-07T20:33:00.7475340Z         contiguous: bool,
2025-05-07T20:33:00.7475585Z         compiled: bool,
2025-05-07T20:33:00.7475820Z     ) -> None:
2025-05-07T20:33:00.7476028Z         torch.manual_seed(2025)
2025-05-07T20:33:00.7476280Z     
2025-05-07T20:33:00.7476559Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.7476900Z     
2025-05-07T20:33:00.7477480Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.7477785Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.7478107Z         x = x_sign * x_clamp
2025-05-07T20:33:00.7478343Z         x0 = x[:, :D]
2025-05-07T20:33:00.7478565Z         x1 = x[:, D:]
2025-05-07T20:33:00.7478777Z     
2025-05-07T20:33:00.7478958Z         if contiguous:
2025-05-07T20:33:00.7479196Z             x0 = x0.contiguous()
2025-05-07T20:33:00.7479460Z             x1 = x1.contiguous()
2025-05-07T20:33:00.7479695Z     
2025-05-07T20:33:00.7479888Z         if scale_ub is not None:
2025-05-07T20:33:00.7480164Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.7480497Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.7480804Z             )
2025-05-07T20:33:00.7481003Z         else:
2025-05-07T20:33:00.7481211Z             scale_ub_tensor = None
2025-05-07T20:33:00.7481472Z     
2025-05-07T20:33:00.7481709Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.7482021Z             op = silu_mul_quant
2025-05-07T20:33:00.7482359Z             if compiled:
2025-05-07T20:33:00.7482608Z                 op = torch.compile(op)
2025-05-07T20:33:00.7482904Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.7483168Z     
2025-05-07T20:33:00.7483360Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.7483522Z 
2025-05-07T20:33:00.7483632Z moe/activation_test.py:117: 
2025-05-07T20:33:00.7483923Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.7484253Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.7484539Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.7485216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.7485993Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.7486542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.7487236Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.7487924Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.7488454Z     kernel = self.compile(
2025-05-07T20:33:00.7489022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.7489665Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.7490063Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.7490298Z 
2025-05-07T20:33:00.7490505Z self = <triton.compiler.compiler.ASTSource object at 0x7f35d7589ad0>
2025-05-07T20:33:00.7491583Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.7493008Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f35d72fcae0>}
2025-05-07T20:33:00.7494321Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.7495333Z context = <triton._C.libtriton.ir.context object at 0x7f35d71240b0>
2025-05-07T20:33:00.7495624Z 
2025-05-07T20:33:00.7495790Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.7496312Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.7496779Z                            module_map=module_map)
2025-05-07T20:33:00.7497153Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.7497591Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.7497851Z E       ^
2025-05-07T20:33:00.7498315Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.7498785Z 
2025-05-07T20:33:00.7499215Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:00.7499721Z 
2025-05-07T20:33:00.7499833Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.7500236Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.7500644Z     T=2048,
2025-05-07T20:33:00.7500836Z     D=7168,
2025-05-07T20:33:00.7501021Z     scale_ub=None,
2025-05-07T20:33:00.7501240Z     contiguous=False,
2025-05-07T20:33:00.7501469Z     compiled=False,
2025-05-07T20:33:00.7501670Z )
2025-05-07T20:33:00.7501992Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.7502487Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:00.7502802Z 
2025-05-07T20:33:00.7502890Z     @given(
2025-05-07T20:33:00.7503114Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.7503427Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.7503736Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.7504059Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.7504390Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.7504676Z     )
2025-05-07T20:33:00.7505018Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.7505467Z     def test_silu_mul_quant(
2025-05-07T20:33:00.7505709Z         self,
2025-05-07T20:33:00.7505909Z         T: int,
2025-05-07T20:33:00.7506146Z         D: int,
2025-05-07T20:33:00.7506367Z         scale_ub: Optional[float],
2025-05-07T20:33:00.7506636Z         contiguous: bool,
2025-05-07T20:33:00.7506876Z         compiled: bool,
2025-05-07T20:33:00.7507103Z     ) -> None:
2025-05-07T20:33:00.7507319Z         torch.manual_seed(2025)
2025-05-07T20:33:00.7507635Z     
2025-05-07T20:33:00.7507913Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.7510061Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:00.7511963Z 
2025-05-07T20:33:00.7512090Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:00.7512299Z 
2025-05-07T20:33:00.7512415Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:00.7512826Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:00.7513251Z     T=128,
2025-05-07T20:33:00.7513444Z     D=7168,
2025-05-07T20:33:00.7513634Z     scale_ub=1200.0,
2025-05-07T20:33:00.7513864Z     contiguous=True,
2025-05-07T20:33:00.7514090Z     compiled=True,
2025-05-07T20:33:00.7514288Z )
2025-05-07T20:33:00.7514610Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:00.7515095Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:00.7515357Z 
2025-05-07T20:33:00.7515435Z     @given(
2025-05-07T20:33:00.7515664Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:00.7515979Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:00.7516292Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:00.7516614Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:00.7517021Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:00.7517311Z     )
2025-05-07T20:33:00.7517653Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:00.7518099Z     def test_silu_mul_quant(
2025-05-07T20:33:00.7518345Z         self,
2025-05-07T20:33:00.7518554Z         T: int,
2025-05-07T20:33:00.7518758Z         D: int,
2025-05-07T20:33:00.7518970Z         scale_ub: Optional[float],
2025-05-07T20:33:00.7519244Z         contiguous: bool,
2025-05-07T20:33:00.7519493Z         compiled: bool,
2025-05-07T20:33:00.7519709Z     ) -> None:
2025-05-07T20:33:00.7519928Z         torch.manual_seed(2025)
2025-05-07T20:33:00.7520172Z     
2025-05-07T20:33:00.7520442Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:00.7520785Z     
2025-05-07T20:33:00.7520977Z         x_sign = torch.sign(x)
2025-05-07T20:33:00.7521272Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:00.7521584Z         x = x_sign * x_clamp
2025-05-07T20:33:00.7521883Z         x0 = x[:, :D]
2025-05-07T20:33:00.7522102Z         x1 = x[:, D:]
2025-05-07T20:33:00.7522302Z     
2025-05-07T20:33:00.7522488Z         if contiguous:
2025-05-07T20:33:00.7522722Z             x0 = x0.contiguous()
2025-05-07T20:33:00.7522981Z             x1 = x1.contiguous()
2025-05-07T20:33:00.7523219Z     
2025-05-07T20:33:00.7523412Z         if scale_ub is not None:
2025-05-07T20:33:00.7523678Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:00.7524013Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:00.7524329Z             )
2025-05-07T20:33:00.7524515Z         else:
2025-05-07T20:33:00.7524728Z             scale_ub_tensor = None
2025-05-07T20:33:00.7524979Z     
2025-05-07T20:33:00.7525209Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:00.7525562Z             op = silu_mul_quant
2025-05-07T20:33:00.7525820Z             if compiled:
2025-05-07T20:33:00.7526079Z                 op = torch.compile(op)
2025-05-07T20:33:00.7526373Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.7526648Z     
2025-05-07T20:33:00.7526845Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:00.7527007Z 
2025-05-07T20:33:00.7527103Z moe/activation_test.py:117: 
2025-05-07T20:33:00.7527398Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.7527731Z moe/activation_test.py:115: in fn
2025-05-07T20:33:00.7528003Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:00.7528568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:00.7529138Z     return fn(*args, **kwargs)
2025-05-07T20:33:00.7529801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:00.7530477Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:00.7531026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:00.7531731Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:00.7532395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:00.7532919Z     kernel = self.compile(
2025-05-07T20:33:00.7533469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:00.7534146Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:00.7534539Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:00.7534773Z 
2025-05-07T20:33:00.7534981Z self = <triton.compiler.compiler.ASTSource object at 0x7f35d777f650>
2025-05-07T20:33:00.7536141Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:00.7537501Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f35d7180040>}
2025-05-07T20:33:00.7538829Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:00.7539835Z context = <triton._C.libtriton.ir.context object at 0x7f35d710bf70>
2025-05-07T20:33:00.7540395Z 
2025-05-07T20:33:00.7540565Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:00.7541086Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:00.7541565Z                            module_map=module_map)
2025-05-07T20:33:00.7541936Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:00.7542424Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:00.7542685Z E       ^
2025-05-07T20:33:00.7543147Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:00.7543604Z 
2025-05-07T20:33:00.7544022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.0865526Z 
2025-05-07T20:33:01.0865859Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.0866326Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.0866911Z     T=128,
2025-05-07T20:33:01.0867199Z     D=7168,
2025-05-07T20:33:01.0867543Z     scale_ub=1200.0,
2025-05-07T20:33:01.0868128Z     contiguous=True,
2025-05-07T20:33:01.0868358Z     compiled=False,
2025-05-07T20:33:01.0868567Z )
2025-05-07T20:33:01.0868906Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.0869410Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:01.0869694Z 
2025-05-07T20:33:01.0869785Z     @given(
2025-05-07T20:33:01.0870014Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.0870332Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.0870650Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.0870984Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.0871321Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.0871614Z     )
2025-05-07T20:33:01.0871963Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.0872431Z     def test_silu_mul_quant(
2025-05-07T20:33:01.0872686Z         self,
2025-05-07T20:33:01.0872890Z         T: int,
2025-05-07T20:33:01.0873091Z         D: int,
2025-05-07T20:33:01.0873318Z         scale_ub: Optional[float],
2025-05-07T20:33:01.0873604Z         contiguous: bool,
2025-05-07T20:33:01.0873847Z         compiled: bool,
2025-05-07T20:33:01.0874088Z     ) -> None:
2025-05-07T20:33:01.0874313Z         torch.manual_seed(2025)
2025-05-07T20:33:01.0874555Z     
2025-05-07T20:33:01.0874836Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.0875185Z     
2025-05-07T20:33:01.0875376Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.0875680Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.0877939Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:01.0879806Z 
2025-05-07T20:33:01.0879926Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:01.0880138Z 
2025-05-07T20:33:01.0880250Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.0880663Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.0881077Z     T=128,
2025-05-07T20:33:01.0881271Z     D=5120,
2025-05-07T20:33:01.0881461Z     scale_ub=1200.0,
2025-05-07T20:33:01.0881691Z     contiguous=True,
2025-05-07T20:33:01.0881918Z     compiled=True,
2025-05-07T20:33:01.0882121Z )
2025-05-07T20:33:01.0882445Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.0882935Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:01.0883215Z 
2025-05-07T20:33:01.0883301Z     @given(
2025-05-07T20:33:01.0883535Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.0883853Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.0884241Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.0884567Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.0884896Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.0885185Z     )
2025-05-07T20:33:01.0885530Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.0885978Z     def test_silu_mul_quant(
2025-05-07T20:33:01.0886230Z         self,
2025-05-07T20:33:01.0886424Z         T: int,
2025-05-07T20:33:01.0886627Z         D: int,
2025-05-07T20:33:01.0886850Z         scale_ub: Optional[float],
2025-05-07T20:33:01.0887114Z         contiguous: bool,
2025-05-07T20:33:01.0887363Z         compiled: bool,
2025-05-07T20:33:01.0887641Z     ) -> None:
2025-05-07T20:33:01.0887861Z         torch.manual_seed(2025)
2025-05-07T20:33:01.0888100Z     
2025-05-07T20:33:01.0888381Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.0888738Z     
2025-05-07T20:33:01.0888927Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.0889225Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.0891220Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:01.0893168Z 
2025-05-07T20:33:01.0893295Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:01.0893503Z 
2025-05-07T20:33:01.0893608Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.0894027Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.0894432Z     T=128,
2025-05-07T20:33:01.0894625Z     D=7168,
2025-05-07T20:33:01.0894812Z     scale_ub=None,
2025-05-07T20:33:01.0895028Z     contiguous=True,
2025-05-07T20:33:01.0895256Z     compiled=True,
2025-05-07T20:33:01.0895454Z )
2025-05-07T20:33:01.0895777Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.0896262Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:01.0896529Z 
2025-05-07T20:33:01.0896608Z     @given(
2025-05-07T20:33:01.0896838Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.0897155Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.0897453Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.0897792Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.0898204Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.0898495Z     )
2025-05-07T20:33:01.0898844Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.0899293Z     def test_silu_mul_quant(
2025-05-07T20:33:01.0899539Z         self,
2025-05-07T20:33:01.0899728Z         T: int,
2025-05-07T20:33:01.0899928Z         D: int,
2025-05-07T20:33:01.0900150Z         scale_ub: Optional[float],
2025-05-07T20:33:01.0900414Z         contiguous: bool,
2025-05-07T20:33:01.0900654Z         compiled: bool,
2025-05-07T20:33:01.0900878Z     ) -> None:
2025-05-07T20:33:01.0901087Z         torch.manual_seed(2025)
2025-05-07T20:33:01.0901330Z     
2025-05-07T20:33:01.0901602Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.0903615Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:01.0905604Z 
2025-05-07T20:33:01.0905728Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:01.0905937Z 
2025-05-07T20:33:01.0909060Z FAILED
2025-05-07T20:33:01.0909192Z 
2025-05-07T20:33:01.0909336Z =================================== FAILURES ===================================
2025-05-07T20:33:01.0909945Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:33:01.0910561Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:33:01.0911473Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 58, in testPartExecutor
2025-05-07T20:33:01.0912232Z   |     yield
2025-05-07T20:33:01.0912816Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 651, in run
2025-05-07T20:33:01.0913525Z   |     self._callTestMethod(testMethod)
2025-05-07T20:33:01.0914074Z   |     ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:33:01.0914819Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 606, in _callTestMethod
2025-05-07T20:33:01.0915561Z   |     if method() is not None:
2025-05-07T20:33:01.0915902Z   |        ~~~~~~^^
2025-05-07T20:33:01.0929656Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:33:01.0930787Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.0931232Z   |            ^^^^^^^
2025-05-07T20:33:01.0932115Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:33:01.0933036Z   |     raise the_error_hypothesis_found
2025-05-07T20:33:01.0933643Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:33:01.0934240Z   +-+---------------- 1 ----------------
2025-05-07T20:33:01.0934662Z     | Traceback (most recent call last):
2025-05-07T20:33:01.0935682Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:01.0936813Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.0939989Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:01.0943194Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:01.0943814Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.0944394Z     |     T=2048,
2025-05-07T20:33:01.0944714Z     |     D=5120,  # or any other generated value
2025-05-07T20:33:01.0945194Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:33:01.0945708Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:33:01.0946222Z     |     compiled=False,  # or any other generated value
2025-05-07T20:33:01.0946655Z     | )
2025-05-07T20:33:01.0946908Z     | 
2025-05-07T20:33:01.0947772Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:33:01.0948678Z     +---------------- 2 ----------------
2025-05-07T20:33:01.0949199Z     | Traceback (most recent call last):
2025-05-07T20:33:01.0950223Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:01.0951357Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.0954354Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:01.0957100Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:01.0957741Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.0958328Z     |     T=128,
2025-05-07T20:33:01.0958608Z     |     D=7168,
2025-05-07T20:33:01.0958898Z     |     scale_ub=None,
2025-05-07T20:33:01.0959232Z     |     contiguous=True,
2025-05-07T20:33:01.0959558Z     |     compiled=True,
2025-05-07T20:33:01.0959859Z     | )
2025-05-07T20:33:01.0960105Z     | 
2025-05-07T20:33:01.0960768Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:33:01.0961365Z     +---------------- 3 ----------------
2025-05-07T20:33:01.0961652Z     | Traceback (most recent call last):
2025-05-07T20:33:01.0962359Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:01.0963121Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.0965125Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:01.0967145Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:01.0967583Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.0967983Z     |     T=128,
2025-05-07T20:33:01.0968177Z     |     D=5120,
2025-05-07T20:33:01.0968546Z     |     scale_ub=1200.0,
2025-05-07T20:33:01.0968791Z     |     contiguous=True,
2025-05-07T20:33:01.0969028Z     |     compiled=True,
2025-05-07T20:33:01.0969249Z     | )
2025-05-07T20:33:01.0969418Z     | 
2025-05-07T20:33:01.0969927Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:33:01.0970517Z     +---------------- 4 ----------------
2025-05-07T20:33:01.0970799Z     | Traceback (most recent call last):
2025-05-07T20:33:01.0971501Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:33:01.0972275Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:01.0972558Z     |                              ~~~~~~^^
2025-05-07T20:33:01.0973338Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:33:01.0974334Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.0975594Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:33:01.0976740Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:01.0977129Z     |     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
2025-05-07T20:33:01.0977503Z     |         a,
2025-05-07T20:33:01.0977789Z     |         ^^
2025-05-07T20:33:01.0978074Z     |     ...<23 lines>...
2025-05-07T20:33:01.0978419Z     |         USE_INT64=use_int64,
2025-05-07T20:33:01.0978705Z     |         ^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:01.0979018Z     |     )
2025-05-07T20:33:01.0979266Z     |     ^
2025-05-07T20:33:01.0980004Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:33:01.0981155Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.0981819Z     |                                    ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:01.1004141Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:33:01.1004958Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:01.1005422Z     |                        ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:01.1006070Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:33:01.1006781Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:01.1007172Z     |            ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:01.1007806Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:33:01.1008394Z     |     fn()
2025-05-07T20:33:01.1008589Z     |     ~~^^
2025-05-07T20:33:01.1009169Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:33:01.1009809Z     |     self.fn.run(
2025-05-07T20:33:01.1010027Z     |     ~~~~~~~~~~~^
2025-05-07T20:33:01.1010235Z     |         *args,
2025-05-07T20:33:01.1010432Z     |         ^^^^^^
2025-05-07T20:33:01.1010638Z     |         **current,
2025-05-07T20:33:01.1010856Z     |         ^^^^^^^^^^
2025-05-07T20:33:01.1011065Z     |     )
2025-05-07T20:33:01.1011246Z     |     ^
2025-05-07T20:33:01.1011739Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:33:01.1012309Z     |     kernel = self.compile(
2025-05-07T20:33:01.1012551Z     |         src,
2025-05-07T20:33:01.1012892Z     |         target=target,
2025-05-07T20:33:01.1013148Z     |         options=options.__dict__,
2025-05-07T20:33:01.1013412Z     |     )
2025-05-07T20:33:01.1013966Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:33:01.1014669Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1015360Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:33:01.1016147Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1016617Z     |                        module_map=module_map)
2025-05-07T20:33:01.1016972Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1017312Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:33:01.1017570Z     | ^
2025-05-07T20:33:01.1018027Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1018673Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:01.1019063Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:33:01.1019576Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1020000Z     |     T=1,  # or any other generated value
2025-05-07T20:33:01.1020298Z     |     D=5120,  # or any other generated value
2025-05-07T20:33:01.1020623Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:33:01.1020972Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:33:01.1021319Z     |     compiled=True,  # or any other generated value
2025-05-07T20:33:01.1021615Z     | )
2025-05-07T20:33:01.1021880Z     | 
2025-05-07T20:33:01.1022395Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:33:01.1022988Z     +------------------------------------
2025-05-07T20:33:01.1023343Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:33:01.1023720Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1024120Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1024511Z     T=1,
2025-05-07T20:33:01.1024691Z     D=5120,
2025-05-07T20:33:01.1024871Z     scale_ub=None,
2025-05-07T20:33:01.1025078Z     contiguous=True,
2025-05-07T20:33:01.1025292Z     compiled=True,
2025-05-07T20:33:01.1025485Z )
2025-05-07T20:33:01.1025797Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1026268Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:01.1026531Z 
2025-05-07T20:33:01.1026612Z     @given(
2025-05-07T20:33:01.1026826Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1027133Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1027528Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1027854Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1028173Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1028449Z     )
2025-05-07T20:33:01.1028788Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1029233Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1029470Z         self,
2025-05-07T20:33:01.1029658Z         T: int,
2025-05-07T20:33:01.1029843Z         D: int,
2025-05-07T20:33:01.1030055Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1030322Z         contiguous: bool,
2025-05-07T20:33:01.1030552Z         compiled: bool,
2025-05-07T20:33:01.1030769Z     ) -> None:
2025-05-07T20:33:01.1030980Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1031211Z     
2025-05-07T20:33:01.1031478Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1031909Z     
2025-05-07T20:33:01.1032094Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1032381Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1032679Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1032903Z         x0 = x[:, :D]
2025-05-07T20:33:01.1033109Z         x1 = x[:, D:]
2025-05-07T20:33:01.1033305Z     
2025-05-07T20:33:01.1033481Z         if contiguous:
2025-05-07T20:33:01.1033699Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1033947Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1034177Z     
2025-05-07T20:33:01.1034353Z         if scale_ub is not None:
2025-05-07T20:33:01.1034620Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1034948Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1035241Z             )
2025-05-07T20:33:01.1035427Z         else:
2025-05-07T20:33:01.1035629Z             scale_ub_tensor = None
2025-05-07T20:33:01.1035863Z     
2025-05-07T20:33:01.1036095Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1036445Z             op = silu_mul_quant
2025-05-07T20:33:01.1036681Z             if compiled:
2025-05-07T20:33:01.1036919Z                 op = torch.compile(op)
2025-05-07T20:33:01.1037205Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1037462Z     
2025-05-07T20:33:01.1037646Z         y_fp8, y_scale = fn()
2025-05-07T20:33:01.1037921Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:01.1038201Z     
2025-05-07T20:33:01.1038421Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1038942Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:01.1039228Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:01.1039524Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:01.1039927Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.1040525Z     
2025-05-07T20:33:01.1040798Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:01.1041067Z 
2025-05-07T20:33:01.1041197Z moe/activation_test.py:126: 
2025-05-07T20:33:01.1041575Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1042001Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:01.1042364Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.1043138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:01.1043877Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:01.1044411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1045085Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1045791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:01.1046509Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:01.1047241Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:01.1047871Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:01.1048473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:01.1048981Z     fn()
2025-05-07T20:33:01.1049495Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:01.1050090Z     self.fn.run(
2025-05-07T20:33:01.1050555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1051070Z     kernel = self.compile(
2025-05-07T20:33:01.1051826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1052471Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1052858Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1053080Z 
2025-05-07T20:33:01.1053281Z self = <triton.compiler.compiler.ASTSource object at 0x7f37b42af9d0>
2025-05-07T20:33:01.1054340Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1055713Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37b445b6a0>}
2025-05-07T20:33:01.1057037Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1058140Z context = <triton._C.libtriton.ir.context object at 0x7f37b4778e70>
2025-05-07T20:33:01.1058433Z 
2025-05-07T20:33:01.1058601Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1059122Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1059580Z                            module_map=module_map)
2025-05-07T20:33:01.1059931Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1060281Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:01.1060538Z E       ^
2025-05-07T20:33:01.1060985Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1061510Z 
2025-05-07T20:33:01.1061933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1062439Z 
2025-05-07T20:33:01.1062539Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1062938Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1063320Z     T=2048,
2025-05-07T20:33:01.1063509Z     D=5120,
2025-05-07T20:33:01.1063694Z     scale_ub=1200.0,
2025-05-07T20:33:01.1063905Z     contiguous=True,
2025-05-07T20:33:01.1064119Z     compiled=False,
2025-05-07T20:33:01.1064318Z )
2025-05-07T20:33:01.1064622Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1065102Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:01.1065371Z 
2025-05-07T20:33:01.1065444Z     @given(
2025-05-07T20:33:01.1065664Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1065967Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1066265Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1066591Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1066908Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1067187Z     )
2025-05-07T20:33:01.1067655Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1068077Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1068314Z         self,
2025-05-07T20:33:01.1068503Z         T: int,
2025-05-07T20:33:01.1068688Z         D: int,
2025-05-07T20:33:01.1068902Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1069167Z         contiguous: bool,
2025-05-07T20:33:01.1069404Z         compiled: bool,
2025-05-07T20:33:01.1069613Z     ) -> None:
2025-05-07T20:33:01.1069820Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1070058Z     
2025-05-07T20:33:01.1070321Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1070658Z     
2025-05-07T20:33:01.1070844Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1071210Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1071520Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1071752Z         x0 = x[:, :D]
2025-05-07T20:33:01.1071954Z         x1 = x[:, D:]
2025-05-07T20:33:01.1072154Z     
2025-05-07T20:33:01.1072329Z         if contiguous:
2025-05-07T20:33:01.1072551Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1072798Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1073027Z     
2025-05-07T20:33:01.1073206Z         if scale_ub is not None:
2025-05-07T20:33:01.1073468Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1073793Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1074091Z             )
2025-05-07T20:33:01.1074268Z         else:
2025-05-07T20:33:01.1074471Z             scale_ub_tensor = None
2025-05-07T20:33:01.1074718Z     
2025-05-07T20:33:01.1074933Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1075234Z             op = silu_mul_quant
2025-05-07T20:33:01.1075481Z             if compiled:
2025-05-07T20:33:01.1075713Z                 op = torch.compile(op)
2025-05-07T20:33:01.1076047Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1076311Z     
2025-05-07T20:33:01.1076495Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1076660Z 
2025-05-07T20:33:01.1076754Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1077038Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1077361Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1077632Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1078332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1079032Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1079599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1080272Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1080944Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1081478Z     kernel = self.compile(
2025-05-07T20:33:01.1082048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1082687Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1083074Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1083296Z 
2025-05-07T20:33:01.1083497Z self = <triton.compiler.compiler.ASTSource object at 0x7f37b4449a70>
2025-05-07T20:33:01.1084556Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1085910Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37b40c1f80>}
2025-05-07T20:33:01.1087233Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1088236Z context = <triton._C.libtriton.ir.context object at 0x7f37ae7fe670>
2025-05-07T20:33:01.1088516Z 
2025-05-07T20:33:01.1088676Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1089196Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1089662Z                            module_map=module_map)
2025-05-07T20:33:01.1090023Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1090368Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1090738Z E       ^
2025-05-07T20:33:01.1091200Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1091650Z 
2025-05-07T20:33:01.1092076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1092582Z 
2025-05-07T20:33:01.1092679Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1093081Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1093480Z     T=2048,
2025-05-07T20:33:01.1093657Z     D=5120,
2025-05-07T20:33:01.1093840Z     scale_ub=1200.0,
2025-05-07T20:33:01.1094051Z     contiguous=True,
2025-05-07T20:33:01.1094257Z     compiled=True,
2025-05-07T20:33:01.1094450Z )
2025-05-07T20:33:01.1094763Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1095242Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:01.1095517Z 
2025-05-07T20:33:01.1095639Z     @given(
2025-05-07T20:33:01.1095862Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1096161Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1096459Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1096781Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1097095Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1097367Z     )
2025-05-07T20:33:01.1097707Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1098153Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1098386Z         self,
2025-05-07T20:33:01.1098574Z         T: int,
2025-05-07T20:33:01.1098762Z         D: int,
2025-05-07T20:33:01.1099016Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1099281Z         contiguous: bool,
2025-05-07T20:33:01.1099514Z         compiled: bool,
2025-05-07T20:33:01.1099725Z     ) -> None:
2025-05-07T20:33:01.1099947Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1100188Z     
2025-05-07T20:33:01.1100446Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1100780Z     
2025-05-07T20:33:01.1100966Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1101255Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1101554Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1101790Z         x0 = x[:, :D]
2025-05-07T20:33:01.1102005Z         x1 = x[:, D:]
2025-05-07T20:33:01.1102302Z     
2025-05-07T20:33:01.1102566Z         if contiguous:
2025-05-07T20:33:01.1102868Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1103421Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1103738Z     
2025-05-07T20:33:01.1103981Z         if scale_ub is not None:
2025-05-07T20:33:01.1120006Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1120474Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1120899Z             )
2025-05-07T20:33:01.1121098Z         else:
2025-05-07T20:33:01.1121309Z             scale_ub_tensor = None
2025-05-07T20:33:01.1121558Z     
2025-05-07T20:33:01.1121782Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1122093Z             op = silu_mul_quant
2025-05-07T20:33:01.1122338Z             if compiled:
2025-05-07T20:33:01.1122577Z                 op = torch.compile(op)
2025-05-07T20:33:01.1122866Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1123133Z     
2025-05-07T20:33:01.1123313Z         y_fp8, y_scale = fn()
2025-05-07T20:33:01.1123589Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:01.1123866Z     
2025-05-07T20:33:01.1124088Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1124416Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:01.1124701Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:01.1125196Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:01.1125547Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.1125848Z     
2025-05-07T20:33:01.1126040Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:01.1126229Z 
2025-05-07T20:33:01.1126325Z moe/activation_test.py:126: 
2025-05-07T20:33:01.1126617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1126950Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:01.1127263Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.1128038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:01.1128770Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:01.1129304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1129988Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1130740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:01.1131454Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:01.1132178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:01.1132799Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:01.1133402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:01.1133904Z     fn()
2025-05-07T20:33:01.1134409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:01.1135052Z     self.fn.run(
2025-05-07T20:33:01.1135515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1136034Z     kernel = self.compile(
2025-05-07T20:33:01.1136582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1137350Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1137740Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1137962Z 
2025-05-07T20:33:01.1138165Z self = <triton.compiler.compiler.ASTSource object at 0x7f37ae642e00>
2025-05-07T20:33:01.1139236Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1141131Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37b41191c0>}
2025-05-07T20:33:01.1142462Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1143477Z context = <triton._C.libtriton.ir.context object at 0x7f37ae5b9f30>
2025-05-07T20:33:01.1143762Z 
2025-05-07T20:33:01.1143924Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1144437Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1144900Z                            module_map=module_map)
2025-05-07T20:33:01.1145260Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1145608Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:01.1145870Z E       ^
2025-05-07T20:33:01.1146570Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1147036Z 
2025-05-07T20:33:01.1147577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1148285Z 
2025-05-07T20:33:01.1148420Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1148960Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1149478Z     T=16384,
2025-05-07T20:33:01.1149657Z     D=7168,
2025-05-07T20:33:01.1149843Z     scale_ub=1200.0,
2025-05-07T20:33:01.1150059Z     contiguous=False,
2025-05-07T20:33:01.1150272Z     compiled=False,
2025-05-07T20:33:01.1150473Z )
2025-05-07T20:33:01.1150784Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1151264Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:01.1151555Z 
2025-05-07T20:33:01.1151629Z     @given(
2025-05-07T20:33:01.1151856Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1152153Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1152571Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1152892Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1153208Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1153476Z     )
2025-05-07T20:33:01.1153812Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1154238Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1154464Z         self,
2025-05-07T20:33:01.1154653Z         T: int,
2025-05-07T20:33:01.1154838Z         D: int,
2025-05-07T20:33:01.1155042Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1155303Z         contiguous: bool,
2025-05-07T20:33:01.1155534Z         compiled: bool,
2025-05-07T20:33:01.1155820Z     ) -> None:
2025-05-07T20:33:01.1156029Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1156261Z     
2025-05-07T20:33:01.1156525Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1156858Z     
2025-05-07T20:33:01.1157045Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1157320Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1157621Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1157852Z         x0 = x[:, :D]
2025-05-07T20:33:01.1158056Z         x1 = x[:, D:]
2025-05-07T20:33:01.1158249Z     
2025-05-07T20:33:01.1158422Z         if contiguous:
2025-05-07T20:33:01.1158641Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1158882Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1159114Z     
2025-05-07T20:33:01.1159299Z         if scale_ub is not None:
2025-05-07T20:33:01.1159560Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1159889Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1160191Z             )
2025-05-07T20:33:01.1160372Z         else:
2025-05-07T20:33:01.1160575Z             scale_ub_tensor = None
2025-05-07T20:33:01.1160825Z     
2025-05-07T20:33:01.1161041Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1161346Z             op = silu_mul_quant
2025-05-07T20:33:01.1161590Z             if compiled:
2025-05-07T20:33:01.1161822Z                 op = torch.compile(op)
2025-05-07T20:33:01.1162110Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1162374Z     
2025-05-07T20:33:01.1162555Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1162712Z 
2025-05-07T20:33:01.1162803Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1163090Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1163409Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1163672Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1164370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1165051Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1165660Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1166347Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1166999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1167519Z     kernel = self.compile(
2025-05-07T20:33:01.1168053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1168717Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1169105Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1169325Z 
2025-05-07T20:33:01.1169536Z self = <triton.compiler.compiler.ASTSource object at 0x7f37ae643130>
2025-05-07T20:33:01.1170600Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1172033Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37b42aa980>}
2025-05-07T20:33:01.1173389Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1174391Z context = <triton._C.libtriton.ir.context object at 0x7f37ae5f18b0>
2025-05-07T20:33:01.1174670Z 
2025-05-07T20:33:01.1174835Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1175382Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1175847Z                            module_map=module_map)
2025-05-07T20:33:01.1176207Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1176543Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1176791Z E       ^
2025-05-07T20:33:01.1177248Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1177691Z 
2025-05-07T20:33:01.1178113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1178613Z 
2025-05-07T20:33:01.1178709Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1179113Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1179508Z     T=1,
2025-05-07T20:33:01.1179679Z     D=7168,
2025-05-07T20:33:01.1179872Z     scale_ub=None,
2025-05-07T20:33:01.1180074Z     contiguous=True,
2025-05-07T20:33:01.1180282Z     compiled=True,
2025-05-07T20:33:01.1180477Z )
2025-05-07T20:33:01.1180794Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1181267Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:01.1181518Z 
2025-05-07T20:33:01.1181591Z     @given(
2025-05-07T20:33:01.1181810Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1182112Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1182402Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1182723Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1183041Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1183311Z     )
2025-05-07T20:33:01.1183649Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1184088Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1184317Z         self,
2025-05-07T20:33:01.1184499Z         T: int,
2025-05-07T20:33:01.1184690Z         D: int,
2025-05-07T20:33:01.1184993Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1185255Z         contiguous: bool,
2025-05-07T20:33:01.1185487Z         compiled: bool,
2025-05-07T20:33:01.1185699Z     ) -> None:
2025-05-07T20:33:01.1185897Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1186132Z     
2025-05-07T20:33:01.1186397Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1186719Z     
2025-05-07T20:33:01.1186902Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1187181Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1187573Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1187807Z         x0 = x[:, :D]
2025-05-07T20:33:01.1188022Z         x1 = x[:, D:]
2025-05-07T20:33:01.1188217Z     
2025-05-07T20:33:01.1188393Z         if contiguous:
2025-05-07T20:33:01.1188620Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1188865Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1189098Z     
2025-05-07T20:33:01.1189288Z         if scale_ub is not None:
2025-05-07T20:33:01.1189546Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1189931Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1190225Z             )
2025-05-07T20:33:01.1190409Z         else:
2025-05-07T20:33:01.1190605Z             scale_ub_tensor = None
2025-05-07T20:33:01.1190852Z     
2025-05-07T20:33:01.1191076Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1191376Z             op = silu_mul_quant
2025-05-07T20:33:01.1191623Z             if compiled:
2025-05-07T20:33:01.1191862Z                 op = torch.compile(op)
2025-05-07T20:33:01.1192144Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1192409Z     
2025-05-07T20:33:01.1192604Z         y_fp8, y_scale = fn()
2025-05-07T20:33:01.1192872Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:01.1193204Z     
2025-05-07T20:33:01.1193438Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1193768Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:01.1194055Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:01.1194358Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:01.1194708Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.1195002Z     
2025-05-07T20:33:01.1195195Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:01.1195383Z 
2025-05-07T20:33:01.1195483Z moe/activation_test.py:126: 
2025-05-07T20:33:01.1195768Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1196093Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:01.1196414Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.1197180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:01.1197923Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:01.1198482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1199155Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1199833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:01.1200542Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:01.1201259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:01.1201886Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:01.1202474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:01.1202978Z     fn()
2025-05-07T20:33:01.1203582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:01.1204168Z     self.fn.run(
2025-05-07T20:33:01.1204628Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1205150Z     kernel = self.compile(
2025-05-07T20:33:01.1205686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1206316Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1206701Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1206921Z 
2025-05-07T20:33:01.1207128Z self = <triton.compiler.compiler.ASTSource object at 0x7f37aebbbc50>
2025-05-07T20:33:01.1208194Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1209536Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37aec76520>}
2025-05-07T20:33:01.1210899Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1211903Z context = <triton._C.libtriton.ir.context object at 0x7f37ae2be070>
2025-05-07T20:33:01.1212184Z 
2025-05-07T20:33:01.1212351Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1212852Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1213351Z                            module_map=module_map)
2025-05-07T20:33:01.1213705Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1214059Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:01.1214313Z E       ^
2025-05-07T20:33:01.1214774Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1215217Z 
2025-05-07T20:33:01.1215656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1216154Z 
2025-05-07T20:33:01.1216257Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1216650Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1217040Z     T=4096,
2025-05-07T20:33:01.1217227Z     D=5120,
2025-05-07T20:33:01.1217407Z     scale_ub=None,
2025-05-07T20:33:01.1217624Z     contiguous=False,
2025-05-07T20:33:01.1217846Z     compiled=False,
2025-05-07T20:33:01.1218044Z )
2025-05-07T20:33:01.1218356Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1218845Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:01.1219116Z 
2025-05-07T20:33:01.1219189Z     @given(
2025-05-07T20:33:01.1219414Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1219719Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1220023Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1220337Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1220657Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1220933Z     )
2025-05-07T20:33:01.1221265Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1221702Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1221943Z         self,
2025-05-07T20:33:01.1222129Z         T: int,
2025-05-07T20:33:01.1222320Z         D: int,
2025-05-07T20:33:01.1222533Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1222795Z         contiguous: bool,
2025-05-07T20:33:01.1223028Z         compiled: bool,
2025-05-07T20:33:01.1223324Z     ) -> None:
2025-05-07T20:33:01.1223529Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1223762Z     
2025-05-07T20:33:01.1224027Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1224363Z     
2025-05-07T20:33:01.1224542Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1224826Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1225129Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1225354Z         x0 = x[:, :D]
2025-05-07T20:33:01.1225560Z         x1 = x[:, D:]
2025-05-07T20:33:01.1225762Z     
2025-05-07T20:33:01.1225932Z         if contiguous:
2025-05-07T20:33:01.1226152Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1226394Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1226617Z     
2025-05-07T20:33:01.1226800Z         if scale_ub is not None:
2025-05-07T20:33:01.1227062Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1227384Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1227761Z             )
2025-05-07T20:33:01.1227993Z         else:
2025-05-07T20:33:01.1228188Z             scale_ub_tensor = None
2025-05-07T20:33:01.1228431Z     
2025-05-07T20:33:01.1228650Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1228952Z             op = silu_mul_quant
2025-05-07T20:33:01.1229194Z             if compiled:
2025-05-07T20:33:01.1229437Z                 op = torch.compile(op)
2025-05-07T20:33:01.1229722Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1229984Z     
2025-05-07T20:33:01.1230170Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1230327Z 
2025-05-07T20:33:01.1230427Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1230704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1230857Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1230950Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1231468Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1231571Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1231935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1232161Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1232496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1232586Z     kernel = self.compile(
2025-05-07T20:33:01.1232990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1233162Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1233289Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1233300Z 
2025-05-07T20:33:01.1233504Z self = <triton.compiler.compiler.ASTSource object at 0x7f37aff2a4e0>
2025-05-07T20:33:01.1234272Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1234807Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37aec77f60>}
2025-05-07T20:33:01.1235540Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1235735Z context = <triton._C.libtriton.ir.context object at 0x7f37897430f0>
2025-05-07T20:33:01.1235739Z 
2025-05-07T20:33:01.1235998Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1236262Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1236374Z                            module_map=module_map)
2025-05-07T20:33:01.1236530Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1236631Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1236704Z E       ^
2025-05-07T20:33:01.1237052Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1237057Z 
2025-05-07T20:33:01.1237487Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1237492Z 
2025-05-07T20:33:01.1237590Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1237819Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1237889Z     T=4096,
2025-05-07T20:33:01.1237963Z     D=7168,
2025-05-07T20:33:01.1238052Z     scale_ub=None,
2025-05-07T20:33:01.1238203Z     contiguous=False,
2025-05-07T20:33:01.1238278Z     compiled=False,
2025-05-07T20:33:01.1238356Z )
2025-05-07T20:33:01.1238576Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1238752Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:01.1238756Z 
2025-05-07T20:33:01.1238836Z     @given(
2025-05-07T20:33:01.1238948Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1239043Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1239163Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1239275Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1239395Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1239511Z     )
2025-05-07T20:33:01.1239791Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1239932Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1240038Z         self,
2025-05-07T20:33:01.1240358Z         T: int,
2025-05-07T20:33:01.1240478Z         D: int,
2025-05-07T20:33:01.1240629Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1240755Z         contiguous: bool,
2025-05-07T20:33:01.1240899Z         compiled: bool,
2025-05-07T20:33:01.1241010Z     ) -> None:
2025-05-07T20:33:01.1241135Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1241232Z     
2025-05-07T20:33:01.1241401Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1241482Z     
2025-05-07T20:33:01.1241570Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1241690Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1241779Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1241858Z         x0 = x[:, :D]
2025-05-07T20:33:01.1241935Z         x1 = x[:, D:]
2025-05-07T20:33:01.1242009Z     
2025-05-07T20:33:01.1242087Z         if contiguous:
2025-05-07T20:33:01.1242178Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1242272Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1242340Z     
2025-05-07T20:33:01.1242424Z         if scale_ub is not None:
2025-05-07T20:33:01.1242530Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1242662Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1242739Z             )
2025-05-07T20:33:01.1242813Z         else:
2025-05-07T20:33:01.1242902Z             scale_ub_tensor = None
2025-05-07T20:33:01.1242974Z     
2025-05-07T20:33:01.1243101Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1243191Z             op = silu_mul_quant
2025-05-07T20:33:01.1243278Z             if compiled:
2025-05-07T20:33:01.1243372Z                 op = torch.compile(op)
2025-05-07T20:33:01.1243476Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1243550Z     
2025-05-07T20:33:01.1243635Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1243640Z 
2025-05-07T20:33:01.1243921Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1244058Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1244155Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1244256Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1244761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1244853Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1245230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1245453Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1245801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1245893Z     kernel = self.compile(
2025-05-07T20:33:01.1246294Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1246538Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1246662Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1246666Z 
2025-05-07T20:33:01.1246866Z self = <triton.compiler.compiler.ASTSource object at 0x7f37ae908f30>
2025-05-07T20:33:01.1247639Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1248139Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37aec76ca0>}
2025-05-07T20:33:01.1248949Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1249138Z context = <triton._C.libtriton.ir.context object at 0x7f37898d32f0>
2025-05-07T20:33:01.1249143Z 
2025-05-07T20:33:01.1249309Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1249573Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1249677Z                            module_map=module_map)
2025-05-07T20:33:01.1249841Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1249936Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1250008Z E       ^
2025-05-07T20:33:01.1250372Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1250380Z 
2025-05-07T20:33:01.1250814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1250825Z 
2025-05-07T20:33:01.1250968Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1251267Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1251368Z     T=128,
2025-05-07T20:33:01.1251472Z     D=7168,
2025-05-07T20:33:01.1251575Z     scale_ub=None,
2025-05-07T20:33:01.1251682Z     contiguous=False,
2025-05-07T20:33:01.1251796Z     compiled=True,
2025-05-07T20:33:01.1251889Z )
2025-05-07T20:33:01.1252968Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1253138Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:01.1253142Z 
2025-05-07T20:33:01.1253218Z     @given(
2025-05-07T20:33:01.1253338Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1253440Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1253654Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1253775Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1253888Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1253967Z     )
2025-05-07T20:33:01.1254215Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1254304Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1254383Z         self,
2025-05-07T20:33:01.1254458Z         T: int,
2025-05-07T20:33:01.1254530Z         D: int,
2025-05-07T20:33:01.1254631Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1254716Z         contiguous: bool,
2025-05-07T20:33:01.1254798Z         compiled: bool,
2025-05-07T20:33:01.1254880Z     ) -> None:
2025-05-07T20:33:01.1254969Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1255040Z     
2025-05-07T20:33:01.1255218Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1255288Z     
2025-05-07T20:33:01.1255378Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1255510Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1255643Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1255725Z         x0 = x[:, :D]
2025-05-07T20:33:01.1255800Z         x1 = x[:, D:]
2025-05-07T20:33:01.1255867Z     
2025-05-07T20:33:01.1255951Z         if contiguous:
2025-05-07T20:33:01.1256038Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1256122Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1256197Z     
2025-05-07T20:33:01.1256282Z         if scale_ub is not None:
2025-05-07T20:33:01.1256382Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1256520Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1256595Z             )
2025-05-07T20:33:01.1256665Z         else:
2025-05-07T20:33:01.1256761Z             scale_ub_tensor = None
2025-05-07T20:33:01.1256876Z     
2025-05-07T20:33:01.1257007Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1257099Z             op = silu_mul_quant
2025-05-07T20:33:01.1257182Z             if compiled:
2025-05-07T20:33:01.1257286Z                 op = torch.compile(op)
2025-05-07T20:33:01.1257389Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1257457Z     
2025-05-07T20:33:01.1257549Z         y_fp8, y_scale = fn()
2025-05-07T20:33:01.1257666Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:01.1257737Z     
2025-05-07T20:33:01.1257874Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1257970Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:01.1258065Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:01.1264953Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:01.1265121Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.1265208Z     
2025-05-07T20:33:01.1265308Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:01.1265314Z 
2025-05-07T20:33:01.1265414Z moe/activation_test.py:126: 
2025-05-07T20:33:01.1265555Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1265660Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:01.1265792Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.1266372Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:01.1266471Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:01.1266843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1267060Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1267544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:01.1267920Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:01.1268315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:01.1268480Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:01.1268824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:01.1268898Z     fn()
2025-05-07T20:33:01.1269320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:01.1269400Z     self.fn.run(
2025-05-07T20:33:01.1269733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1269830Z     kernel = self.compile(
2025-05-07T20:33:01.1270212Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1270394Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1270562Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1270567Z 
2025-05-07T20:33:01.1270767Z self = <triton.compiler.compiler.ASTSource object at 0x7f3789a647d0>
2025-05-07T20:33:01.1271573Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1272091Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37ae368180>}
2025-05-07T20:33:01.1272824Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1273057Z context = <triton._C.libtriton.ir.context object at 0x7f3789b1ce30>
2025-05-07T20:33:01.1273064Z 
2025-05-07T20:33:01.1273225Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1273487Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1273594Z                            module_map=module_map)
2025-05-07T20:33:01.1273757Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1273854Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:01.1273928Z E       ^
2025-05-07T20:33:01.1274292Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1274297Z 
2025-05-07T20:33:01.1274710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1274718Z 
2025-05-07T20:33:01.1274821Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1275043Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1275119Z     T=128,
2025-05-07T20:33:01.1275194Z     D=7168,
2025-05-07T20:33:01.1275270Z     scale_ub=None,
2025-05-07T20:33:01.1275357Z     contiguous=False,
2025-05-07T20:33:01.1275443Z     compiled=False,
2025-05-07T20:33:01.1275511Z )
2025-05-07T20:33:01.1275725Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1275897Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:01.1275901Z 
2025-05-07T20:33:01.1275976Z     @given(
2025-05-07T20:33:01.1276089Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1276193Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1276303Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1276422Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1276632Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1276706Z     )
2025-05-07T20:33:01.1276957Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1277046Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1277120Z         self,
2025-05-07T20:33:01.1277199Z         T: int,
2025-05-07T20:33:01.1277270Z         D: int,
2025-05-07T20:33:01.1277361Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1277451Z         contiguous: bool,
2025-05-07T20:33:01.1277532Z         compiled: bool,
2025-05-07T20:33:01.1277612Z     ) -> None:
2025-05-07T20:33:01.1277701Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1277769Z     
2025-05-07T20:33:01.1277938Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1278007Z     
2025-05-07T20:33:01.1278093Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1278220Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1278303Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1278385Z         x0 = x[:, :D]
2025-05-07T20:33:01.1278507Z         x1 = x[:, D:]
2025-05-07T20:33:01.1278577Z     
2025-05-07T20:33:01.1278656Z         if contiguous:
2025-05-07T20:33:01.1278750Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1278832Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1278899Z     
2025-05-07T20:33:01.1278992Z         if scale_ub is not None:
2025-05-07T20:33:01.1279095Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1279232Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1279306Z             )
2025-05-07T20:33:01.1279383Z         else:
2025-05-07T20:33:01.1279479Z             scale_ub_tensor = None
2025-05-07T20:33:01.1279549Z     
2025-05-07T20:33:01.1279675Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1279806Z             op = silu_mul_quant
2025-05-07T20:33:01.1279888Z             if compiled:
2025-05-07T20:33:01.1279985Z                 op = torch.compile(op)
2025-05-07T20:33:01.1280100Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1280174Z     
2025-05-07T20:33:01.1280261Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1280274Z 
2025-05-07T20:33:01.1280368Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1280493Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1280597Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1280692Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1281232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1281369Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1281851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1282199Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1282702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1282799Z     kernel = self.compile(
2025-05-07T20:33:01.1283205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1283374Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1283498Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1283503Z 
2025-05-07T20:33:01.1283712Z self = <triton.compiler.compiler.ASTSource object at 0x7f37aebc2c90>
2025-05-07T20:33:01.1284476Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1285105Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37ae36b100>}
2025-05-07T20:33:01.1285839Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1286028Z context = <triton._C.libtriton.ir.context object at 0x7f3789b507f0>
2025-05-07T20:33:01.1286033Z 
2025-05-07T20:33:01.1286191Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1286447Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1286558Z                            module_map=module_map)
2025-05-07T20:33:01.1286715Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1286811Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1286886Z E       ^
2025-05-07T20:33:01.1287238Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1287283Z 
2025-05-07T20:33:01.1287704Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1287709Z 
2025-05-07T20:33:01.1287805Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1288024Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1288105Z     T=4096,
2025-05-07T20:33:01.1288181Z     D=5120,
2025-05-07T20:33:01.1288262Z     scale_ub=1200.0,
2025-05-07T20:33:01.1288346Z     contiguous=True,
2025-05-07T20:33:01.1288426Z     compiled=False,
2025-05-07T20:33:01.1288502Z )
2025-05-07T20:33:01.1288720Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1288934Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:01.1288939Z 
2025-05-07T20:33:01.1289017Z     @given(
2025-05-07T20:33:01.1289136Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1289232Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1289349Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1289460Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1289567Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1289643Z     )
2025-05-07T20:33:01.1289883Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1289976Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1290047Z         self,
2025-05-07T20:33:01.1290119Z         T: int,
2025-05-07T20:33:01.1290195Z         D: int,
2025-05-07T20:33:01.1290288Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1290374Z         contiguous: bool,
2025-05-07T20:33:01.1290467Z         compiled: bool,
2025-05-07T20:33:01.1290544Z     ) -> None:
2025-05-07T20:33:01.1290630Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1290705Z     
2025-05-07T20:33:01.1290874Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1290944Z     
2025-05-07T20:33:01.1291039Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1291158Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1291248Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1291322Z         x0 = x[:, :D]
2025-05-07T20:33:01.1291396Z         x1 = x[:, D:]
2025-05-07T20:33:01.1291466Z     
2025-05-07T20:33:01.1291544Z         if contiguous:
2025-05-07T20:33:01.1291628Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1291716Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1291784Z     
2025-05-07T20:33:01.1291866Z         if scale_ub is not None:
2025-05-07T20:33:01.1291973Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1292105Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1292179Z             )
2025-05-07T20:33:01.1292254Z         else:
2025-05-07T20:33:01.1292420Z             scale_ub_tensor = None
2025-05-07T20:33:01.1292494Z     
2025-05-07T20:33:01.1292622Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1292706Z             op = silu_mul_quant
2025-05-07T20:33:01.1292794Z             if compiled:
2025-05-07T20:33:01.1292889Z                 op = torch.compile(op)
2025-05-07T20:33:01.1292991Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1293070Z     
2025-05-07T20:33:01.1293155Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1293159Z 
2025-05-07T20:33:01.1293249Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1293379Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1293476Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1293573Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1294085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1294182Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1294601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1294824Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1295168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1295260Z     kernel = self.compile(
2025-05-07T20:33:01.1295655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1295832Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1295955Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1296003Z 
2025-05-07T20:33:01.1296201Z self = <triton.compiler.compiler.ASTSource object at 0x7f37ae8f1390>
2025-05-07T20:33:01.1296976Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1297488Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37ae1b1f80>}
2025-05-07T20:33:01.1298225Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1298410Z context = <triton._C.libtriton.ir.context object at 0x7f3789be1cb0>
2025-05-07T20:33:01.1298415Z 
2025-05-07T20:33:01.1298575Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1298846Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1298954Z                            module_map=module_map)
2025-05-07T20:33:01.1299122Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1299215Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1299288Z E       ^
2025-05-07T20:33:01.1299658Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1299663Z 
2025-05-07T20:33:01.1300088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1300093Z 
2025-05-07T20:33:01.1300198Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1300416Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1300489Z     T=1,
2025-05-07T20:33:01.1300573Z     D=5120,
2025-05-07T20:33:01.1300654Z     scale_ub=None,
2025-05-07T20:33:01.1300736Z     contiguous=True,
2025-05-07T20:33:01.1300822Z     compiled=True,
2025-05-07T20:33:01.1300969Z )
2025-05-07T20:33:01.1301189Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1301351Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:01.1301356Z 
2025-05-07T20:33:01.1301429Z     @given(
2025-05-07T20:33:01.1301550Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1301643Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1301755Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1301873Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1301981Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1302048Z     )
2025-05-07T20:33:01.1302301Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1302397Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1302482Z         self,
2025-05-07T20:33:01.1302568Z         T: int,
2025-05-07T20:33:01.1302661Z         D: int,
2025-05-07T20:33:01.1302771Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1303492Z         contiguous: bool,
2025-05-07T20:33:01.1303574Z         compiled: bool,
2025-05-07T20:33:01.1303655Z     ) -> None:
2025-05-07T20:33:01.1303744Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1303812Z     
2025-05-07T20:33:01.1303979Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1304048Z     
2025-05-07T20:33:01.1304134Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1304261Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1304343Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1304416Z         x0 = x[:, :D]
2025-05-07T20:33:01.1304499Z         x1 = x[:, D:]
2025-05-07T20:33:01.1304565Z     
2025-05-07T20:33:01.1304641Z         if contiguous:
2025-05-07T20:33:01.1304801Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1304883Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1304956Z     
2025-05-07T20:33:01.1305046Z         if scale_ub is not None:
2025-05-07T20:33:01.1305149Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1305289Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1305363Z             )
2025-05-07T20:33:01.1305433Z         else:
2025-05-07T20:33:01.1305527Z             scale_ub_tensor = None
2025-05-07T20:33:01.1305596Z     
2025-05-07T20:33:01.1305723Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1305811Z             op = silu_mul_quant
2025-05-07T20:33:01.1305891Z             if compiled:
2025-05-07T20:33:01.1305984Z                 op = torch.compile(op)
2025-05-07T20:33:01.1306090Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1306157Z     
2025-05-07T20:33:01.1306250Z         y_fp8, y_scale = fn()
2025-05-07T20:33:01.1306369Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:01.1306438Z     
2025-05-07T20:33:01.1306575Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1306672Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:01.1306768Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:01.1306893Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:01.1307027Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.1307093Z     
2025-05-07T20:33:01.1307192Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:01.1307197Z 
2025-05-07T20:33:01.1307289Z moe/activation_test.py:126: 
2025-05-07T20:33:01.1307494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1307594Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:01.1307721Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.1308290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:01.1308471Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:01.1308828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1309052Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1309419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:01.1309676Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:01.1310052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:01.1310216Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:01.1310558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:01.1310640Z     fn()
2025-05-07T20:33:01.1311068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:01.1311186Z     self.fn.run(
2025-05-07T20:33:01.1311535Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1311630Z     kernel = self.compile(
2025-05-07T20:33:01.1312006Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1312176Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1312309Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1312314Z 
2025-05-07T20:33:01.1312513Z self = <triton.compiler.compiler.ASTSource object at 0x7f37ae8f1910>
2025-05-07T20:33:01.1313285Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1313851Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37ae35e520>}
2025-05-07T20:33:01.1314589Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1314774Z context = <triton._C.libtriton.ir.context object at 0x7f3789bb41b0>
2025-05-07T20:33:01.1314778Z 
2025-05-07T20:33:01.1314935Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1315200Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1315304Z                            module_map=module_map)
2025-05-07T20:33:01.1315462Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1315570Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:01.1315644Z E       ^
2025-05-07T20:33:01.1315999Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1316004Z 
2025-05-07T20:33:01.1316428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1316433Z 
2025-05-07T20:33:01.1316530Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1316761Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1316834Z     T=2048,
2025-05-07T20:33:01.1316913Z     D=5120,
2025-05-07T20:33:01.1316988Z     scale_ub=None,
2025-05-07T20:33:01.1317069Z     contiguous=True,
2025-05-07T20:33:01.1317152Z     compiled=True,
2025-05-07T20:33:01.1317217Z )
2025-05-07T20:33:01.1317433Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1317681Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:01.1317688Z 
2025-05-07T20:33:01.1317759Z     @given(
2025-05-07T20:33:01.1317872Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1317971Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1318081Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1318191Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1318305Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1318373Z     )
2025-05-07T20:33:01.1318625Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1318711Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1318780Z         self,
2025-05-07T20:33:01.1318856Z         T: int,
2025-05-07T20:33:01.1318931Z         D: int,
2025-05-07T20:33:01.1319024Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1319117Z         contiguous: bool,
2025-05-07T20:33:01.1319201Z         compiled: bool,
2025-05-07T20:33:01.1319272Z     ) -> None:
2025-05-07T20:33:01.1319407Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1319474Z     
2025-05-07T20:33:01.1319635Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1319710Z     
2025-05-07T20:33:01.1319794Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1319920Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1320000Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1320075Z         x0 = x[:, :D]
2025-05-07T20:33:01.1320157Z         x1 = x[:, D:]
2025-05-07T20:33:01.1320226Z     
2025-05-07T20:33:01.1320304Z         if contiguous:
2025-05-07T20:33:01.1320394Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1320481Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1320589Z     
2025-05-07T20:33:01.1320679Z         if scale_ub is not None:
2025-05-07T20:33:01.1320778Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1320913Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1320998Z             )
2025-05-07T20:33:01.1321066Z         else:
2025-05-07T20:33:01.1321161Z             scale_ub_tensor = None
2025-05-07T20:33:01.1321231Z     
2025-05-07T20:33:01.1321354Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1321443Z             op = silu_mul_quant
2025-05-07T20:33:01.1321523Z             if compiled:
2025-05-07T20:33:01.1321616Z                 op = torch.compile(op)
2025-05-07T20:33:01.1321722Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1321791Z     
2025-05-07T20:33:01.1321877Z         y_fp8, y_scale = fn()
2025-05-07T20:33:01.1321998Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:01.1322067Z     
2025-05-07T20:33:01.1322196Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1322301Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:01.1322395Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:01.1322521Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:01.1322658Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.1322725Z     
2025-05-07T20:33:01.1322824Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:01.1322829Z 
2025-05-07T20:33:01.1322919Z moe/activation_test.py:126: 
2025-05-07T20:33:01.1323041Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1323147Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:01.1323274Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.1323839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:01.1323939Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:01.1324298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1324607Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1324987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:01.1325242Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:01.1325618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:01.1325779Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:01.1326133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:01.1326206Z     fn()
2025-05-07T20:33:01.1326617Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:01.1326705Z     self.fn.run(
2025-05-07T20:33:01.1327042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1327178Z     kernel = self.compile(
2025-05-07T20:33:01.1327572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1327742Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1327873Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1327877Z 
2025-05-07T20:33:01.1328077Z self = <triton.compiler.compiler.ASTSource object at 0x7f37ae8ee840>
2025-05-07T20:33:01.1328838Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1329394Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37ae10a840>}
2025-05-07T20:33:01.1330129Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1330324Z context = <triton._C.libtriton.ir.context object at 0x7f378963bf70>
2025-05-07T20:33:01.1330329Z 
2025-05-07T20:33:01.1330486Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1330758Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1330861Z                            module_map=module_map)
2025-05-07T20:33:01.1331018Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1331124Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:01.1331193Z E       ^
2025-05-07T20:33:01.1331546Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1331559Z 
2025-05-07T20:33:01.1331982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1331987Z 
2025-05-07T20:33:01.1332085Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1332307Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1332381Z     T=128,
2025-05-07T20:33:01.1332452Z     D=5120,
2025-05-07T20:33:01.1332537Z     scale_ub=None,
2025-05-07T20:33:01.1332616Z     contiguous=True,
2025-05-07T20:33:01.1332690Z     compiled=True,
2025-05-07T20:33:01.1332767Z )
2025-05-07T20:33:01.1332989Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1333158Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:01.1333163Z 
2025-05-07T20:33:01.1333234Z     @given(
2025-05-07T20:33:01.1333426Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1333529Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1333638Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1333750Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1333864Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1333933Z     )
2025-05-07T20:33:01.1334176Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1334269Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1334340Z         self,
2025-05-07T20:33:01.1334417Z         T: int,
2025-05-07T20:33:01.1334488Z         D: int,
2025-05-07T20:33:01.1334580Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1334672Z         contiguous: bool,
2025-05-07T20:33:01.1334755Z         compiled: bool,
2025-05-07T20:33:01.1334827Z     ) -> None:
2025-05-07T20:33:01.1334922Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1334990Z     
2025-05-07T20:33:01.1335159Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1335302Z     
2025-05-07T20:33:01.1335391Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1335511Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1335598Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1335674Z         x0 = x[:, :D]
2025-05-07T20:33:01.1335746Z         x1 = x[:, D:]
2025-05-07T20:33:01.1335819Z     
2025-05-07T20:33:01.1335896Z         if contiguous:
2025-05-07T20:33:01.1335987Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1336070Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1336135Z     
2025-05-07T20:33:01.1336222Z         if scale_ub is not None:
2025-05-07T20:33:01.1336323Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1336495Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1336568Z             )
2025-05-07T20:33:01.1336637Z         else:
2025-05-07T20:33:01.1336732Z             scale_ub_tensor = None
2025-05-07T20:33:01.1336807Z     
2025-05-07T20:33:01.1336930Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1337015Z             op = silu_mul_quant
2025-05-07T20:33:01.1337100Z             if compiled:
2025-05-07T20:33:01.1337193Z                 op = torch.compile(op)
2025-05-07T20:33:01.1337304Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1337372Z     
2025-05-07T20:33:01.1337455Z         y_fp8, y_scale = fn()
2025-05-07T20:33:01.1337577Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:01.1337642Z     
2025-05-07T20:33:01.1337773Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1337874Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:01.1337967Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:01.1338085Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:01.1338229Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.1338295Z     
2025-05-07T20:33:01.1338399Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:01.1338403Z 
2025-05-07T20:33:01.1338496Z moe/activation_test.py:126: 
2025-05-07T20:33:01.1338619Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1338722Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:01.1338851Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.1339422Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:01.1339522Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:01.1339891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1340387Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1341063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:01.1341327Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:01.1341713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:01.1341877Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:01.1342222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:01.1342298Z     fn()
2025-05-07T20:33:01.1342716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:01.1342802Z     self.fn.run(
2025-05-07T20:33:01.1343138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1343225Z     kernel = self.compile(
2025-05-07T20:33:01.1343630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1343862Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1343991Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1343996Z 
2025-05-07T20:33:01.1344197Z self = <triton.compiler.compiler.ASTSource object at 0x7f37ae8de850>
2025-05-07T20:33:01.1344958Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1345473Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37b43aa160>}
2025-05-07T20:33:01.1346272Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1346467Z context = <triton._C.libtriton.ir.context object at 0x7f3788ac1670>
2025-05-07T20:33:01.1346472Z 
2025-05-07T20:33:01.1346630Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1346894Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1347004Z                            module_map=module_map)
2025-05-07T20:33:01.1347161Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1347267Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:01.1347339Z E       ^
2025-05-07T20:33:01.1347776Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1347785Z 
2025-05-07T20:33:01.1348222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1348230Z 
2025-05-07T20:33:01.1348328Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1348556Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1348631Z     T=4096,
2025-05-07T20:33:01.1348704Z     D=5120,
2025-05-07T20:33:01.1348787Z     scale_ub=None,
2025-05-07T20:33:01.1348869Z     contiguous=True,
2025-05-07T20:33:01.1348946Z     compiled=True,
2025-05-07T20:33:01.1349028Z )
2025-05-07T20:33:01.1349244Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1349408Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:01.1349413Z 
2025-05-07T20:33:01.1349495Z     @given(
2025-05-07T20:33:01.1349609Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1349703Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1349900Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1350016Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1350129Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1350203Z     )
2025-05-07T20:33:01.1350443Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1350539Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1350612Z         self,
2025-05-07T20:33:01.1350688Z         T: int,
2025-05-07T20:33:01.1350765Z         D: int,
2025-05-07T20:33:01.1350857Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1350940Z         contiguous: bool,
2025-05-07T20:33:01.1351027Z         compiled: bool,
2025-05-07T20:33:01.1351103Z     ) -> None:
2025-05-07T20:33:01.1351191Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1351269Z     
2025-05-07T20:33:01.1351433Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1351506Z     
2025-05-07T20:33:01.1351598Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1351759Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1351848Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1351926Z         x0 = x[:, :D]
2025-05-07T20:33:01.1352000Z         x1 = x[:, D:]
2025-05-07T20:33:01.1352077Z     
2025-05-07T20:33:01.1352156Z         if contiguous:
2025-05-07T20:33:01.1352242Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1352330Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1352400Z     
2025-05-07T20:33:01.1352482Z         if scale_ub is not None:
2025-05-07T20:33:01.1352585Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1352715Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1352792Z             )
2025-05-07T20:33:01.1352865Z         else:
2025-05-07T20:33:01.1352996Z             scale_ub_tensor = None
2025-05-07T20:33:01.1353066Z     
2025-05-07T20:33:01.1353189Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1353280Z             op = silu_mul_quant
2025-05-07T20:33:01.1353369Z             if compiled:
2025-05-07T20:33:01.1353462Z                 op = torch.compile(op)
2025-05-07T20:33:01.1353562Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1353636Z     
2025-05-07T20:33:01.1353722Z         y_fp8, y_scale = fn()
2025-05-07T20:33:01.1353880Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:01.1353986Z     
2025-05-07T20:33:01.1354170Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1354316Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:01.1354450Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:01.1354612Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:01.1354813Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.1354913Z     
2025-05-07T20:33:01.1355050Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:01.1355058Z 
2025-05-07T20:33:01.1355207Z moe/activation_test.py:126: 
2025-05-07T20:33:01.1355344Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1355445Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:01.1355583Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.1356140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:01.1356244Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:01.1356603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1356827Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1357203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:01.1357546Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:01.1358005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:01.1358188Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:01.1358594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:01.1358678Z     fn()
2025-05-07T20:33:01.1359166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:01.1359252Z     self.fn.run(
2025-05-07T20:33:01.1359661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1359759Z     kernel = self.compile(
2025-05-07T20:33:01.1360228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1360426Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1360609Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1360614Z 
2025-05-07T20:33:01.1360852Z self = <triton.compiler.compiler.ASTSource object at 0x7f37ae3c3a50>
2025-05-07T20:33:01.1361873Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1362499Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3788e26de0>}
2025-05-07T20:33:01.1363424Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1363678Z context = <triton._C.libtriton.ir.context object at 0x7f37895ab9b0>
2025-05-07T20:33:01.1363692Z 
2025-05-07T20:33:01.1363872Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1364180Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1364298Z                            module_map=module_map)
2025-05-07T20:33:01.1364472Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1364581Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:01.1364664Z E       ^
2025-05-07T20:33:01.1365086Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1365091Z 
2025-05-07T20:33:01.1365597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1365602Z 
2025-05-07T20:33:01.1365715Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1365972Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1366058Z     T=16384,
2025-05-07T20:33:01.1366135Z     D=5120,
2025-05-07T20:33:01.1366221Z     scale_ub=None,
2025-05-07T20:33:01.1366313Z     contiguous=True,
2025-05-07T20:33:01.1366398Z     compiled=True,
2025-05-07T20:33:01.1366471Z )
2025-05-07T20:33:01.1366724Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1366916Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:01.1366920Z 
2025-05-07T20:33:01.1367003Z     @given(
2025-05-07T20:33:01.1367128Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1367229Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1367359Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1367484Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1367707Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1367787Z     )
2025-05-07T20:33:01.1368029Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1368122Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1368195Z         self,
2025-05-07T20:33:01.1368272Z         T: int,
2025-05-07T20:33:01.1368351Z         D: int,
2025-05-07T20:33:01.1368442Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1368527Z         contiguous: bool,
2025-05-07T20:33:01.1368616Z         compiled: bool,
2025-05-07T20:33:01.1368687Z     ) -> None:
2025-05-07T20:33:01.1368777Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1368852Z     
2025-05-07T20:33:01.1369015Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1369087Z     
2025-05-07T20:33:01.1369178Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1369298Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1369387Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1369509Z         x0 = x[:, :D]
2025-05-07T20:33:01.1369585Z         x1 = x[:, D:]
2025-05-07T20:33:01.1369660Z     
2025-05-07T20:33:01.1369740Z         if contiguous:
2025-05-07T20:33:01.1369825Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1369912Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1369980Z     
2025-05-07T20:33:01.1370064Z         if scale_ub is not None:
2025-05-07T20:33:01.1370168Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1370298Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1370370Z             )
2025-05-07T20:33:01.1370449Z         else:
2025-05-07T20:33:01.1370538Z             scale_ub_tensor = None
2025-05-07T20:33:01.1370604Z     
2025-05-07T20:33:01.1370736Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1370865Z             op = silu_mul_quant
2025-05-07T20:33:01.1370951Z             if compiled:
2025-05-07T20:33:01.1371050Z                 op = torch.compile(op)
2025-05-07T20:33:01.1371154Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1371228Z     
2025-05-07T20:33:01.1371315Z         y_fp8, y_scale = fn()
2025-05-07T20:33:01.1371433Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:01.1371505Z     
2025-05-07T20:33:01.1371638Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1371736Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:01.1371841Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:01.1371957Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:01.1372090Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.1372166Z     
2025-05-07T20:33:01.1372262Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:01.1372269Z 
2025-05-07T20:33:01.1372368Z moe/activation_test.py:126: 
2025-05-07T20:33:01.1372498Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1372598Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:01.1372740Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.1373287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:01.1373384Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:01.1373757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1373978Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1374350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:01.1374602Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:01.1375062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:01.1375235Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:01.1375576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:01.1375661Z     fn()
2025-05-07T20:33:01.1376071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:01.1376152Z     self.fn.run(
2025-05-07T20:33:01.1376491Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1376579Z     kernel = self.compile(
2025-05-07T20:33:01.1376972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1377153Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1377279Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1377344Z 
2025-05-07T20:33:01.1377552Z self = <triton.compiler.compiler.ASTSource object at 0x7f37ae3c30d0>
2025-05-07T20:33:01.1378318Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1378828Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3788d477e0>}
2025-05-07T20:33:01.1379566Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1379792Z context = <triton._C.libtriton.ir.context object at 0x7f378946e4b0>
2025-05-07T20:33:01.1379797Z 
2025-05-07T20:33:01.1379966Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1380233Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1380342Z                            module_map=module_map)
2025-05-07T20:33:01.1380498Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1380594Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:01.1380673Z E       ^
2025-05-07T20:33:01.1381020Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1381024Z 
2025-05-07T20:33:01.1381445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1381453Z 
2025-05-07T20:33:01.1381554Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1381775Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1381860Z     T=1,
2025-05-07T20:33:01.1381936Z     D=5120,
2025-05-07T20:33:01.1382013Z     scale_ub=1200.0,
2025-05-07T20:33:01.1382098Z     contiguous=True,
2025-05-07T20:33:01.1382181Z     compiled=True,
2025-05-07T20:33:01.1382251Z )
2025-05-07T20:33:01.1382475Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1382633Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:01.1382638Z 
2025-05-07T20:33:01.1382711Z     @given(
2025-05-07T20:33:01.1382831Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1382925Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1383041Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1383151Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1383262Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1383343Z     )
2025-05-07T20:33:01.1383660Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1383751Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1383834Z         self,
2025-05-07T20:33:01.1383905Z         T: int,
2025-05-07T20:33:01.1383976Z         D: int,
2025-05-07T20:33:01.1384074Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1384157Z         contiguous: bool,
2025-05-07T20:33:01.1384237Z         compiled: bool,
2025-05-07T20:33:01.1384317Z     ) -> None:
2025-05-07T20:33:01.1384405Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1384479Z     
2025-05-07T20:33:01.1384642Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1384711Z     
2025-05-07T20:33:01.1384802Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1384924Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1385009Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1385087Z         x0 = x[:, :D]
2025-05-07T20:33:01.1385163Z         x1 = x[:, D:]
2025-05-07T20:33:01.1385232Z     
2025-05-07T20:33:01.1385318Z         if contiguous:
2025-05-07T20:33:01.1385447Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1385529Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1385605Z     
2025-05-07T20:33:01.1385690Z         if scale_ub is not None:
2025-05-07T20:33:01.1385794Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1385924Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1385996Z             )
2025-05-07T20:33:01.1386075Z         else:
2025-05-07T20:33:01.1386169Z             scale_ub_tensor = None
2025-05-07T20:33:01.1386238Z     
2025-05-07T20:33:01.1386366Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1386451Z             op = silu_mul_quant
2025-05-07T20:33:01.1386532Z             if compiled:
2025-05-07T20:33:01.1386679Z                 op = torch.compile(op)
2025-05-07T20:33:01.1386780Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1386848Z     
2025-05-07T20:33:01.1386946Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1386953Z 
2025-05-07T20:33:01.1387045Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1387181Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1387276Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1387369Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1387835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:01.1387925Z     return fn(*args, **kwargs)
2025-05-07T20:33:01.1388433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1388535Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1400998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1401270Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1401620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1401712Z     kernel = self.compile(
2025-05-07T20:33:01.1402103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1402273Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1402398Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1402409Z 
2025-05-07T20:33:01.1402606Z self = <triton.compiler.compiler.ASTSource object at 0x7f37896e04d0>
2025-05-07T20:33:01.1403370Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1403994Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3788de1440>}
2025-05-07T20:33:01.1404728Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1404912Z context = <triton._C.libtriton.ir.context object at 0x7f37894d5b30>
2025-05-07T20:33:01.1404916Z 
2025-05-07T20:33:01.1405074Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1405328Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1405432Z                            module_map=module_map)
2025-05-07T20:33:01.1405594Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1405688Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1405774Z E       ^
2025-05-07T20:33:01.1406123Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1406171Z 
2025-05-07T20:33:01.1406591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1406596Z 
2025-05-07T20:33:01.1406693Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1406908Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1406987Z     T=1,
2025-05-07T20:33:01.1407068Z     D=5120,
2025-05-07T20:33:01.1407151Z     scale_ub=None,
2025-05-07T20:33:01.1407233Z     contiguous=False,
2025-05-07T20:33:01.1407311Z     compiled=True,
2025-05-07T20:33:01.1407388Z )
2025-05-07T20:33:01.1407599Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1407804Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:01.1407814Z 
2025-05-07T20:33:01.1407893Z     @given(
2025-05-07T20:33:01.1408011Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1408106Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1408225Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1408338Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1408457Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1408529Z     )
2025-05-07T20:33:01.1408766Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1408861Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1408934Z         self,
2025-05-07T20:33:01.1409005Z         T: int,
2025-05-07T20:33:01.1409083Z         D: int,
2025-05-07T20:33:01.1409176Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1409267Z         contiguous: bool,
2025-05-07T20:33:01.1409352Z         compiled: bool,
2025-05-07T20:33:01.1409428Z     ) -> None:
2025-05-07T20:33:01.1409522Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1409601Z     
2025-05-07T20:33:01.1409764Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1409841Z     
2025-05-07T20:33:01.1409929Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1410048Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1410140Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1410215Z         x0 = x[:, :D]
2025-05-07T20:33:01.1410289Z         x1 = x[:, D:]
2025-05-07T20:33:01.1410365Z     
2025-05-07T20:33:01.1410443Z         if contiguous:
2025-05-07T20:33:01.1410530Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1410619Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1410685Z     
2025-05-07T20:33:01.1410769Z         if scale_ub is not None:
2025-05-07T20:33:01.1410879Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1411008Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1411081Z             )
2025-05-07T20:33:01.1411273Z         else:
2025-05-07T20:33:01.1411367Z             scale_ub_tensor = None
2025-05-07T20:33:01.1411441Z     
2025-05-07T20:33:01.1411565Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1411650Z             op = silu_mul_quant
2025-05-07T20:33:01.1411736Z             if compiled:
2025-05-07T20:33:01.1411830Z                 op = torch.compile(op)
2025-05-07T20:33:01.1411931Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1412008Z     
2025-05-07T20:33:01.1412095Z         y_fp8, y_scale = fn()
2025-05-07T20:33:01.1412210Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:01.1412286Z     
2025-05-07T20:33:01.1412415Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1412511Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:01.1412613Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:01.1412731Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:01.1412877Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.1412988Z     
2025-05-07T20:33:01.1413081Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:01.1413086Z 
2025-05-07T20:33:01.1413184Z moe/activation_test.py:126: 
2025-05-07T20:33:01.1413308Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1413409Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:01.1413543Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.1414094Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:01.1414193Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:01.1414548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1414815Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1415189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:01.1415441Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:01.1415817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:01.1415978Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:01.1416315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:01.1416395Z     fn()
2025-05-07T20:33:01.1416791Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:01.1416871Z     self.fn.run(
2025-05-07T20:33:01.1417211Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1417305Z     kernel = self.compile(
2025-05-07T20:33:01.1417688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1417856Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1417979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1417984Z 
2025-05-07T20:33:01.1418189Z self = <triton.compiler.compiler.ASTSource object at 0x7f37aeb48b50>
2025-05-07T20:33:01.1418951Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1419454Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3789f0e5c0>}
2025-05-07T20:33:01.1420267Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1420457Z context = <triton._C.libtriton.ir.context object at 0x7f35d7d4b3f0>
2025-05-07T20:33:01.1420461Z 
2025-05-07T20:33:01.1420629Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1420886Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1420996Z                            module_map=module_map)
2025-05-07T20:33:01.1421154Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1421250Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:01.1421334Z E       ^
2025-05-07T20:33:01.1421682Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1421691Z 
2025-05-07T20:33:01.1422132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1422176Z 
2025-05-07T20:33:01.1422277Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1422491Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1422576Z     T=1,
2025-05-07T20:33:01.1422650Z     D=5120,
2025-05-07T20:33:01.1422726Z     scale_ub=None,
2025-05-07T20:33:01.1422814Z     contiguous=True,
2025-05-07T20:33:01.1422894Z     compiled=False,
2025-05-07T20:33:01.1422963Z )
2025-05-07T20:33:01.1423180Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1423340Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:01.1423385Z 
2025-05-07T20:33:01.1423468Z     @given(
2025-05-07T20:33:01.1423583Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1423682Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1423801Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1423912Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1424021Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1424098Z     )
2025-05-07T20:33:01.1424334Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1424422Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1424498Z         self,
2025-05-07T20:33:01.1424568Z         T: int,
2025-05-07T20:33:01.1424644Z         D: int,
2025-05-07T20:33:01.1424736Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1424818Z         contiguous: bool,
2025-05-07T20:33:01.1424904Z         compiled: bool,
2025-05-07T20:33:01.1424978Z     ) -> None:
2025-05-07T20:33:01.1425070Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1425145Z     
2025-05-07T20:33:01.1425312Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1425379Z     
2025-05-07T20:33:01.1425472Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1425590Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1425676Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1425759Z         x0 = x[:, :D]
2025-05-07T20:33:01.1425834Z         x1 = x[:, D:]
2025-05-07T20:33:01.1425903Z     
2025-05-07T20:33:01.1425988Z         if contiguous:
2025-05-07T20:33:01.1426072Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1426161Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1426230Z     
2025-05-07T20:33:01.1426313Z         if scale_ub is not None:
2025-05-07T20:33:01.1426423Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1426552Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1426628Z             )
2025-05-07T20:33:01.1426708Z         else:
2025-05-07T20:33:01.1426797Z             scale_ub_tensor = None
2025-05-07T20:33:01.1426865Z     
2025-05-07T20:33:01.1427145Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1427234Z             op = silu_mul_quant
2025-05-07T20:33:01.1427316Z             if compiled:
2025-05-07T20:33:01.1427522Z                 op = torch.compile(op)
2025-05-07T20:33:01.1427624Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1427698Z     
2025-05-07T20:33:01.1427781Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1427786Z 
2025-05-07T20:33:01.1427875Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1428000Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1428096Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1428187Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1428681Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1428775Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1429137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1429400Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1429727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1429821Z     kernel = self.compile(
2025-05-07T20:33:01.1430216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1430385Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1430511Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1430516Z 
2025-05-07T20:33:01.1430710Z self = <triton.compiler.compiler.ASTSource object at 0x7f35d7c29450>
2025-05-07T20:33:01.1431521Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1432017Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37ae094fe0>}
2025-05-07T20:33:01.1432752Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1432937Z context = <triton._C.libtriton.ir.context object at 0x7f35d7d0b530>
2025-05-07T20:33:01.1432941Z 
2025-05-07T20:33:01.1433099Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1433362Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1433470Z                            module_map=module_map)
2025-05-07T20:33:01.1433638Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1433734Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1433809Z E       ^
2025-05-07T20:33:01.1434161Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1434165Z 
2025-05-07T20:33:01.1434567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1434572Z 
2025-05-07T20:33:01.1434672Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1434893Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1434967Z     T=128,
2025-05-07T20:33:01.1435046Z     D=5120,
2025-05-07T20:33:01.1435123Z     scale_ub=None,
2025-05-07T20:33:01.1435207Z     contiguous=False,
2025-05-07T20:33:01.1435291Z     compiled=True,
2025-05-07T20:33:01.1435357Z )
2025-05-07T20:33:01.1435645Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1435825Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:01.1435829Z 
2025-05-07T20:33:01.1435901Z     @given(
2025-05-07T20:33:01.1436017Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1436119Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1436228Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1436344Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1436452Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1436524Z     )
2025-05-07T20:33:01.1436769Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1436859Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1436936Z         self,
2025-05-07T20:33:01.1437015Z         T: int,
2025-05-07T20:33:01.1437090Z         D: int,
2025-05-07T20:33:01.1437184Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1437279Z         contiguous: bool,
2025-05-07T20:33:01.1437403Z         compiled: bool,
2025-05-07T20:33:01.1437477Z     ) -> None:
2025-05-07T20:33:01.1437570Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1437639Z     
2025-05-07T20:33:01.1437805Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1437875Z     
2025-05-07T20:33:01.1437961Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1438084Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1438167Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1438241Z         x0 = x[:, :D]
2025-05-07T20:33:01.1438324Z         x1 = x[:, D:]
2025-05-07T20:33:01.1438394Z     
2025-05-07T20:33:01.1438470Z         if contiguous:
2025-05-07T20:33:01.1438560Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1438719Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1438788Z     
2025-05-07T20:33:01.1438876Z         if scale_ub is not None:
2025-05-07T20:33:01.1438981Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1439121Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1439190Z             )
2025-05-07T20:33:01.1439265Z         else:
2025-05-07T20:33:01.1439359Z             scale_ub_tensor = None
2025-05-07T20:33:01.1439427Z     
2025-05-07T20:33:01.1439551Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1439643Z             op = silu_mul_quant
2025-05-07T20:33:01.1439722Z             if compiled:
2025-05-07T20:33:01.1439816Z                 op = torch.compile(op)
2025-05-07T20:33:01.1439923Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1439990Z     
2025-05-07T20:33:01.1440343Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1440349Z 
2025-05-07T20:33:01.1440500Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1440690Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1440847Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1440997Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1441471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:01.1441571Z     return fn(*args, **kwargs)
2025-05-07T20:33:01.1442060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1442152Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1442512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1442729Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1443069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1443162Z     kernel = self.compile(
2025-05-07T20:33:01.1443787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1443970Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1444091Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1444096Z 
2025-05-07T20:33:01.1444300Z self = <triton.compiler.compiler.ASTSource object at 0x7f37889da150>
2025-05-07T20:33:01.1445068Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1445563Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37ae1b0360>}
2025-05-07T20:33:01.1446632Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1446951Z context = <triton._C.libtriton.ir.context object at 0x7f35d7dd1f70>
2025-05-07T20:33:01.1446956Z 
2025-05-07T20:33:01.1447124Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1447379Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1447483Z                            module_map=module_map)
2025-05-07T20:33:01.1447648Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1447743Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1447821Z E       ^
2025-05-07T20:33:01.1448169Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1448245Z 
2025-05-07T20:33:01.1448662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1448669Z 
2025-05-07T20:33:01.1448774Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1448997Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1449075Z     T=128,
2025-05-07T20:33:01.1449157Z     D=7168,
2025-05-07T20:33:01.1449235Z     scale_ub=1200.0,
2025-05-07T20:33:01.1449329Z     contiguous=False,
2025-05-07T20:33:01.1449410Z     compiled=False,
2025-05-07T20:33:01.1449480Z )
2025-05-07T20:33:01.1449702Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1449873Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:01.1449877Z 
2025-05-07T20:33:01.1449951Z     @given(
2025-05-07T20:33:01.1450073Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1450171Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1450280Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1450405Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1450517Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1450596Z     )
2025-05-07T20:33:01.1450835Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1450926Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1451003Z         self,
2025-05-07T20:33:01.1451085Z         T: int,
2025-05-07T20:33:01.1451160Z         D: int,
2025-05-07T20:33:01.1451263Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1451348Z         contiguous: bool,
2025-05-07T20:33:01.1451431Z         compiled: bool,
2025-05-07T20:33:01.1451516Z     ) -> None:
2025-05-07T20:33:01.1451608Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1451675Z     
2025-05-07T20:33:01.1451845Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1451913Z     
2025-05-07T20:33:01.1452005Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1452208Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1452297Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1452376Z         x0 = x[:, :D]
2025-05-07T20:33:01.1452454Z         x1 = x[:, D:]
2025-05-07T20:33:01.1452518Z     
2025-05-07T20:33:01.1452600Z         if contiguous:
2025-05-07T20:33:01.1452686Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1452770Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1452847Z     
2025-05-07T20:33:01.1452930Z         if scale_ub is not None:
2025-05-07T20:33:01.1453029Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1453164Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1453231Z             )
2025-05-07T20:33:01.1453307Z         else:
2025-05-07T20:33:01.1453395Z             scale_ub_tensor = None
2025-05-07T20:33:01.1453462Z     
2025-05-07T20:33:01.1453593Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1453678Z             op = silu_mul_quant
2025-05-07T20:33:01.1453763Z             if compiled:
2025-05-07T20:33:01.1453906Z                 op = torch.compile(op)
2025-05-07T20:33:01.1454006Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1454072Z     
2025-05-07T20:33:01.1454163Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1454168Z 
2025-05-07T20:33:01.1454258Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1454381Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1454491Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1454585Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1455086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1455180Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1455576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1455806Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1456144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1456238Z     kernel = self.compile(
2025-05-07T20:33:01.1456688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1456919Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1457088Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1457098Z 
2025-05-07T20:33:01.1457353Z self = <triton.compiler.compiler.ASTSource object at 0x7f3788f96bd0>
2025-05-07T20:33:01.1458280Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1458786Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3789c0ec00>}
2025-05-07T20:33:01.1459516Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1459705Z context = <triton._C.libtriton.ir.context object at 0x7f35d7f0c9b0>
2025-05-07T20:33:01.1459709Z 
2025-05-07T20:33:01.1459866Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1460124Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1460228Z                            module_map=module_map)
2025-05-07T20:33:01.1460384Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1460587Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1460661Z E       ^
2025-05-07T20:33:01.1461011Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1461016Z 
2025-05-07T20:33:01.1461430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1461435Z 
2025-05-07T20:33:01.1461532Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1461753Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1461825Z     T=128,
2025-05-07T20:33:01.1461900Z     D=5120,
2025-05-07T20:33:01.1461981Z     scale_ub=None,
2025-05-07T20:33:01.1462062Z     contiguous=False,
2025-05-07T20:33:01.1462141Z     compiled=False,
2025-05-07T20:33:01.1462219Z )
2025-05-07T20:33:01.1462431Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1462607Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:01.1462652Z 
2025-05-07T20:33:01.1462724Z     @given(
2025-05-07T20:33:01.1462841Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1462941Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1463058Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1463170Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1463283Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1463349Z     )
2025-05-07T20:33:01.1463584Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1463675Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1463748Z         self,
2025-05-07T20:33:01.1463824Z         T: int,
2025-05-07T20:33:01.1463897Z         D: int,
2025-05-07T20:33:01.1464052Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1464145Z         contiguous: bool,
2025-05-07T20:33:01.1464226Z         compiled: bool,
2025-05-07T20:33:01.1464304Z     ) -> None:
2025-05-07T20:33:01.1464406Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1464477Z     
2025-05-07T20:33:01.1464639Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1464713Z     
2025-05-07T20:33:01.1464802Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1464921Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1465012Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1465090Z         x0 = x[:, :D]
2025-05-07T20:33:01.1465165Z         x1 = x[:, D:]
2025-05-07T20:33:01.1465242Z     
2025-05-07T20:33:01.1465322Z         if contiguous:
2025-05-07T20:33:01.1465415Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1465500Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1465569Z     
2025-05-07T20:33:01.1465661Z         if scale_ub is not None:
2025-05-07T20:33:01.1465765Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1465901Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1465982Z             )
2025-05-07T20:33:01.1466062Z         else:
2025-05-07T20:33:01.1466156Z             scale_ub_tensor = None
2025-05-07T20:33:01.1466232Z     
2025-05-07T20:33:01.1466357Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1466442Z             op = silu_mul_quant
2025-05-07T20:33:01.1466529Z             if compiled:
2025-05-07T20:33:01.1466624Z                 op = torch.compile(op)
2025-05-07T20:33:01.1466732Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1466802Z     
2025-05-07T20:33:01.1466886Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1466891Z 
2025-05-07T20:33:01.1466989Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1467112Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1467210Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1467313Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1467965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1468067Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1468418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1468632Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1468975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1469062Z     kernel = self.compile(
2025-05-07T20:33:01.1469457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1469631Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1469755Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1469760Z 
2025-05-07T20:33:01.1469968Z self = <triton.compiler.compiler.ASTSource object at 0x7f3788b977d0>
2025-05-07T20:33:01.1470797Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1471287Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3788e25e40>}
2025-05-07T20:33:01.1472024Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1472208Z context = <triton._C.libtriton.ir.context object at 0x7f35d7f667f0>
2025-05-07T20:33:01.1472255Z 
2025-05-07T20:33:01.1472422Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1472680Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1472786Z                            module_map=module_map)
2025-05-07T20:33:01.1472950Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1473042Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1473120Z E       ^
2025-05-07T20:33:01.1473466Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1473471Z 
2025-05-07T20:33:01.1473873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1473878Z 
2025-05-07T20:33:01.1473983Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1474199Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1474282Z     T=128,
2025-05-07T20:33:01.1474355Z     D=5120,
2025-05-07T20:33:01.1474438Z     scale_ub=1200.0,
2025-05-07T20:33:01.1474527Z     contiguous=True,
2025-05-07T20:33:01.1474605Z     compiled=False,
2025-05-07T20:33:01.1474674Z )
2025-05-07T20:33:01.1474896Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1475061Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:01.1475065Z 
2025-05-07T20:33:01.1475139Z     @given(
2025-05-07T20:33:01.1475262Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1475358Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1475473Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1475585Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1475693Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1475772Z     )
2025-05-07T20:33:01.1476010Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1476178Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1476257Z         self,
2025-05-07T20:33:01.1476336Z         T: int,
2025-05-07T20:33:01.1476410Z         D: int,
2025-05-07T20:33:01.1476509Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1476596Z         contiguous: bool,
2025-05-07T20:33:01.1476675Z         compiled: bool,
2025-05-07T20:33:01.1476755Z     ) -> None:
2025-05-07T20:33:01.1476847Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1476921Z     
2025-05-07T20:33:01.1477085Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1477155Z     
2025-05-07T20:33:01.1477250Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1477369Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1477454Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1477537Z         x0 = x[:, :D]
2025-05-07T20:33:01.1477617Z         x1 = x[:, D:]
2025-05-07T20:33:01.1477687Z     
2025-05-07T20:33:01.1477774Z         if contiguous:
2025-05-07T20:33:01.1477867Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1477996Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1478070Z     
2025-05-07T20:33:01.1478156Z         if scale_ub is not None:
2025-05-07T20:33:01.1478257Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1478394Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1478466Z             )
2025-05-07T20:33:01.1478544Z         else:
2025-05-07T20:33:01.1478633Z             scale_ub_tensor = None
2025-05-07T20:33:01.1478704Z     
2025-05-07T20:33:01.1478835Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1478921Z             op = silu_mul_quant
2025-05-07T20:33:01.1479001Z             if compiled:
2025-05-07T20:33:01.1479103Z                 op = torch.compile(op)
2025-05-07T20:33:01.1479207Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1479320Z     
2025-05-07T20:33:01.1479414Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1479418Z 
2025-05-07T20:33:01.1479521Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1479656Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1479750Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1479847Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1480344Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1480437Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1480793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1481019Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1481354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1481455Z     kernel = self.compile(
2025-05-07T20:33:01.1481855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1482028Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1482156Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1482160Z 
2025-05-07T20:33:01.1482357Z self = <triton.compiler.compiler.ASTSource object at 0x7f3788f97450>
2025-05-07T20:33:01.1483121Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1483613Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37ae84a0c0>}
2025-05-07T20:33:01.1484426Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1484622Z context = <triton._C.libtriton.ir.context object at 0x7f35d7e13eb0>
2025-05-07T20:33:01.1484627Z 
2025-05-07T20:33:01.1484785Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1485044Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1485148Z                            module_map=module_map)
2025-05-07T20:33:01.1485308Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1485406Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1485481Z E       ^
2025-05-07T20:33:01.1485825Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1485839Z 
2025-05-07T20:33:01.1486247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1486296Z 
2025-05-07T20:33:01.1486395Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1486617Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1486690Z     T=1,
2025-05-07T20:33:01.1486764Z     D=7168,
2025-05-07T20:33:01.1486850Z     scale_ub=1200.0,
2025-05-07T20:33:01.1486929Z     contiguous=True,
2025-05-07T20:33:01.1487008Z     compiled=True,
2025-05-07T20:33:01.1487084Z )
2025-05-07T20:33:01.1487296Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1487463Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:01.1487467Z 
2025-05-07T20:33:01.1487541Z     @given(
2025-05-07T20:33:01.1487654Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1487799Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1487910Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1488027Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1488143Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1488215Z     )
2025-05-07T20:33:01.1488462Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1488550Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1488623Z         self,
2025-05-07T20:33:01.1488702Z         T: int,
2025-05-07T20:33:01.1488775Z         D: int,
2025-05-07T20:33:01.1488869Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1488958Z         contiguous: bool,
2025-05-07T20:33:01.1489039Z         compiled: bool,
2025-05-07T20:33:01.1489114Z     ) -> None:
2025-05-07T20:33:01.1489212Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1489279Z     
2025-05-07T20:33:01.1489445Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1489519Z     
2025-05-07T20:33:01.1489607Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1489732Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1489825Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1489900Z         x0 = x[:, :D]
2025-05-07T20:33:01.1489984Z         x1 = x[:, D:]
2025-05-07T20:33:01.1490055Z     
2025-05-07T20:33:01.1490134Z         if contiguous:
2025-05-07T20:33:01.1490226Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1490310Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1490379Z     
2025-05-07T20:33:01.1490470Z         if scale_ub is not None:
2025-05-07T20:33:01.1490572Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1490703Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1490785Z             )
2025-05-07T20:33:01.1490860Z         else:
2025-05-07T20:33:01.1490955Z             scale_ub_tensor = None
2025-05-07T20:33:01.1491033Z     
2025-05-07T20:33:01.1491159Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1491253Z             op = silu_mul_quant
2025-05-07T20:33:01.1491418Z             if compiled:
2025-05-07T20:33:01.1491517Z                 op = torch.compile(op)
2025-05-07T20:33:01.1491627Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1491697Z     
2025-05-07T20:33:01.1491783Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1491787Z 
2025-05-07T20:33:01.1491886Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1492010Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1492105Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1492210Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1492576Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:01.1492675Z     return fn(*args, **kwargs)
2025-05-07T20:33:01.1493168Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1493265Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1493665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1493882Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1494219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1494321Z     kernel = self.compile(
2025-05-07T20:33:01.1494702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1494877Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1495001Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1495045Z 
2025-05-07T20:33:01.1495242Z self = <triton.compiler.compiler.ASTSource object at 0x7f3788b97950>
2025-05-07T20:33:01.1496015Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1496509Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37ae35f1a0>}
2025-05-07T20:33:01.1497248Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1497432Z context = <triton._C.libtriton.ir.context object at 0x7f35d7c51f70>
2025-05-07T20:33:01.1497437Z 
2025-05-07T20:33:01.1497604Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1497862Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1497972Z                            module_map=module_map)
2025-05-07T20:33:01.1498139Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1498235Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1498305Z E       ^
2025-05-07T20:33:01.1498657Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1498663Z 
2025-05-07T20:33:01.1499073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1499077Z 
2025-05-07T20:33:01.1499185Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1499405Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1499478Z     T=1,
2025-05-07T20:33:01.1499562Z     D=7168,
2025-05-07T20:33:01.1499639Z     scale_ub=1200.0,
2025-05-07T20:33:01.1499723Z     contiguous=False,
2025-05-07T20:33:01.1499808Z     compiled=True,
2025-05-07T20:33:01.1499978Z )
2025-05-07T20:33:01.1500195Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1500364Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:01.1500368Z 
2025-05-07T20:33:01.1500441Z     @given(
2025-05-07T20:33:01.1500563Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1500659Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1500768Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1500884Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1500993Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1501064Z     )
2025-05-07T20:33:01.1501309Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1501400Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1501483Z         self,
2025-05-07T20:33:01.1501558Z         T: int,
2025-05-07T20:33:01.1501639Z         D: int,
2025-05-07T20:33:01.1501739Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1501872Z         contiguous: bool,
2025-05-07T20:33:01.1501952Z         compiled: bool,
2025-05-07T20:33:01.1502037Z     ) -> None:
2025-05-07T20:33:01.1502126Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1502196Z     
2025-05-07T20:33:01.1502365Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1502439Z     
2025-05-07T20:33:01.1502526Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1502652Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1502738Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1502813Z         x0 = x[:, :D]
2025-05-07T20:33:01.1502893Z         x1 = x[:, D:]
2025-05-07T20:33:01.1502961Z     
2025-05-07T20:33:01.1503049Z         if contiguous:
2025-05-07T20:33:01.1503179Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1503263Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1503338Z     
2025-05-07T20:33:01.1503429Z         if scale_ub is not None:
2025-05-07T20:33:01.1503533Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1503669Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1503743Z             )
2025-05-07T20:33:01.1503820Z         else:
2025-05-07T20:33:01.1503917Z             scale_ub_tensor = None
2025-05-07T20:33:01.1503986Z     
2025-05-07T20:33:01.1504109Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1504199Z             op = silu_mul_quant
2025-05-07T20:33:01.1504281Z             if compiled:
2025-05-07T20:33:01.1504382Z                 op = torch.compile(op)
2025-05-07T20:33:01.1504483Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1504553Z     
2025-05-07T20:33:01.1504646Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1504654Z 
2025-05-07T20:33:01.1504750Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1504874Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1504981Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1505078Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1505443Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:01.1505541Z     return fn(*args, **kwargs)
2025-05-07T20:33:01.1506027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1506127Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1506481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1506697Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1507042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1507133Z     kernel = self.compile(
2025-05-07T20:33:01.1507689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1507864Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1507987Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1507992Z 
2025-05-07T20:33:01.1508194Z self = <triton.compiler.compiler.ASTSource object at 0x7f3788a66e50>
2025-05-07T20:33:01.1508956Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1509455Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3789f0eb60>}
2025-05-07T20:33:01.1510194Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1510422Z context = <triton._C.libtriton.ir.context object at 0x7f35d7c5ba70>
2025-05-07T20:33:01.1510427Z 
2025-05-07T20:33:01.1510593Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1510847Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1510957Z                            module_map=module_map)
2025-05-07T20:33:01.1511114Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1511206Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1511289Z E       ^
2025-05-07T20:33:01.1511636Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1511683Z 
2025-05-07T20:33:01.1512099Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1512111Z 
2025-05-07T20:33:01.1512211Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1512428Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1512508Z     T=1,
2025-05-07T20:33:01.1512584Z     D=7168,
2025-05-07T20:33:01.1512665Z     scale_ub=None,
2025-05-07T20:33:01.1512751Z     contiguous=False,
2025-05-07T20:33:01.1512830Z     compiled=True,
2025-05-07T20:33:01.1512902Z )
2025-05-07T20:33:01.1513119Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1513278Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:01.1513282Z 
2025-05-07T20:33:01.1513361Z     @given(
2025-05-07T20:33:01.1513479Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1513575Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1513696Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1513812Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1513919Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1513996Z     )
2025-05-07T20:33:01.1514234Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1514323Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1514401Z         self,
2025-05-07T20:33:01.1514475Z         T: int,
2025-05-07T20:33:01.1514551Z         D: int,
2025-05-07T20:33:01.1514650Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1514734Z         contiguous: bool,
2025-05-07T20:33:01.1514821Z         compiled: bool,
2025-05-07T20:33:01.1514895Z     ) -> None:
2025-05-07T20:33:01.1514984Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1515066Z     
2025-05-07T20:33:01.1515232Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1515299Z     
2025-05-07T20:33:01.1515470Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1515593Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1515676Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1515758Z         x0 = x[:, :D]
2025-05-07T20:33:01.1515834Z         x1 = x[:, D:]
2025-05-07T20:33:01.1515903Z     
2025-05-07T20:33:01.1515987Z         if contiguous:
2025-05-07T20:33:01.1516072Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1516160Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1516230Z     
2025-05-07T20:33:01.1516315Z         if scale_ub is not None:
2025-05-07T20:33:01.1516420Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1516549Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1516620Z             )
2025-05-07T20:33:01.1516700Z         else:
2025-05-07T20:33:01.1516795Z             scale_ub_tensor = None
2025-05-07T20:33:01.1516864Z     
2025-05-07T20:33:01.1516998Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1517092Z             op = silu_mul_quant
2025-05-07T20:33:01.1517267Z             if compiled:
2025-05-07T20:33:01.1517370Z                 op = torch.compile(op)
2025-05-07T20:33:01.1517469Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1517548Z     
2025-05-07T20:33:01.1517633Z         y_fp8, y_scale = fn()
2025-05-07T20:33:01.1517749Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:01.1517826Z     
2025-05-07T20:33:01.1517960Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1518058Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:01.1518158Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:01.1518276Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:01.1518413Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.1518537Z     
2025-05-07T20:33:01.1523336Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:01.1523347Z 
2025-05-07T20:33:01.1523474Z moe/activation_test.py:126: 
2025-05-07T20:33:01.1523607Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1523712Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:01.1523852Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:01.1524406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:01.1524503Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:01.1524864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1525080Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1525470Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:01.1525726Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:01.1526100Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:01.1526268Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:01.1526609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:01.1526685Z     fn()
2025-05-07T20:33:01.1527089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:01.1527171Z     self.fn.run(
2025-05-07T20:33:01.1527511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1527603Z     kernel = self.compile(
2025-05-07T20:33:01.1527984Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1528264Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1528394Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1528399Z 
2025-05-07T20:33:01.1528601Z self = <triton.compiler.compiler.ASTSource object at 0x7f3788a66a50>
2025-05-07T20:33:01.1529368Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1529862Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3788dd3920>}
2025-05-07T20:33:01.1530604Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1530792Z context = <triton._C.libtriton.ir.context object at 0x7f35d7e9e5f0>
2025-05-07T20:33:01.1530836Z 
2025-05-07T20:33:01.1530998Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1531255Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1531359Z                            module_map=module_map)
2025-05-07T20:33:01.1531522Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1531621Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:01.1531706Z E       ^
2025-05-07T20:33:01.1532057Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1532062Z 
2025-05-07T20:33:01.1532516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1532521Z 
2025-05-07T20:33:01.1532633Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1532851Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1532927Z     T=1,
2025-05-07T20:33:01.1533007Z     D=5120,
2025-05-07T20:33:01.1533087Z     scale_ub=1200.0,
2025-05-07T20:33:01.1533178Z     contiguous=False,
2025-05-07T20:33:01.1533258Z     compiled=True,
2025-05-07T20:33:01.1533329Z )
2025-05-07T20:33:01.1533546Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1533709Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:01.1533714Z 
2025-05-07T20:33:01.1533790Z     @given(
2025-05-07T20:33:01.1533911Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1534008Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1534121Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1534242Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1534357Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1534435Z     )
2025-05-07T20:33:01.1534674Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1534763Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1534844Z         self,
2025-05-07T20:33:01.1534924Z         T: int,
2025-05-07T20:33:01.1534994Z         D: int,
2025-05-07T20:33:01.1535093Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1535180Z         contiguous: bool,
2025-05-07T20:33:01.1535259Z         compiled: bool,
2025-05-07T20:33:01.1535340Z     ) -> None:
2025-05-07T20:33:01.1535434Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1535505Z     
2025-05-07T20:33:01.1535679Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1535753Z     
2025-05-07T20:33:01.1535849Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1535972Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1536169Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1536254Z         x0 = x[:, :D]
2025-05-07T20:33:01.1536337Z         x1 = x[:, D:]
2025-05-07T20:33:01.1536406Z     
2025-05-07T20:33:01.1536493Z         if contiguous:
2025-05-07T20:33:01.1536581Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1536666Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1536737Z     
2025-05-07T20:33:01.1536824Z         if scale_ub is not None:
2025-05-07T20:33:01.1536926Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1537067Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1537143Z             )
2025-05-07T20:33:01.1537218Z         else:
2025-05-07T20:33:01.1537315Z             scale_ub_tensor = None
2025-05-07T20:33:01.1537388Z     
2025-05-07T20:33:01.1537522Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1537612Z             op = silu_mul_quant
2025-05-07T20:33:01.1537692Z             if compiled:
2025-05-07T20:33:01.1537799Z                 op = torch.compile(op)
2025-05-07T20:33:01.1537900Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1538017Z     
2025-05-07T20:33:01.1538112Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1538117Z 
2025-05-07T20:33:01.1538210Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1538337Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1538440Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1538535Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1538910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:01.1539001Z     return fn(*args, **kwargs)
2025-05-07T20:33:01.1539489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1539633Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1539991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1540575Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1540975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1541069Z     kernel = self.compile(
2025-05-07T20:33:01.1541457Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1541627Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1541751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1541756Z 
2025-05-07T20:33:01.1541964Z self = <triton.compiler.compiler.ASTSource object at 0x7f35d7c294d0>
2025-05-07T20:33:01.1542740Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1543244Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3788dd2520>}
2025-05-07T20:33:01.1543979Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1544170Z context = <triton._C.libtriton.ir.context object at 0x7f35d7e1f2b0>
2025-05-07T20:33:01.1544175Z 
2025-05-07T20:33:01.1544333Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1544588Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1544703Z                            module_map=module_map)
2025-05-07T20:33:01.1545064Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1545170Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1545250Z E       ^
2025-05-07T20:33:01.1545600Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1545605Z 
2025-05-07T20:33:01.1546049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1546054Z 
2025-05-07T20:33:01.1546153Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1546371Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1546451Z     T=1,
2025-05-07T20:33:01.1546527Z     D=5120,
2025-05-07T20:33:01.1546610Z     scale_ub=1200.0,
2025-05-07T20:33:01.1546705Z     contiguous=False,
2025-05-07T20:33:01.1546783Z     compiled=False,
2025-05-07T20:33:01.1546860Z )
2025-05-07T20:33:01.1547081Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1547308Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:01.1547313Z 
2025-05-07T20:33:01.1547478Z     @given(
2025-05-07T20:33:01.1547595Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1547691Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1547808Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1547921Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1548028Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1548106Z     )
2025-05-07T20:33:01.1548343Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1548437Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1548584Z         self,
2025-05-07T20:33:01.1548659Z         T: int,
2025-05-07T20:33:01.1548740Z         D: int,
2025-05-07T20:33:01.1548834Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1548925Z         contiguous: bool,
2025-05-07T20:33:01.1549013Z         compiled: bool,
2025-05-07T20:33:01.1549090Z     ) -> None:
2025-05-07T20:33:01.1549182Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1549254Z     
2025-05-07T20:33:01.1549418Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1549489Z     
2025-05-07T20:33:01.1549582Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1549702Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1549792Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1549866Z         x0 = x[:, :D]
2025-05-07T20:33:01.1549943Z         x1 = x[:, D:]
2025-05-07T20:33:01.1550017Z     
2025-05-07T20:33:01.1550096Z         if contiguous:
2025-05-07T20:33:01.1550188Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1550280Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1550351Z     
2025-05-07T20:33:01.1550436Z         if scale_ub is not None:
2025-05-07T20:33:01.1550546Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1550677Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1550753Z             )
2025-05-07T20:33:01.1550834Z         else:
2025-05-07T20:33:01.1550922Z             scale_ub_tensor = None
2025-05-07T20:33:01.1550993Z     
2025-05-07T20:33:01.1551122Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1551208Z             op = silu_mul_quant
2025-05-07T20:33:01.1551295Z             if compiled:
2025-05-07T20:33:01.1551392Z                 op = torch.compile(op)
2025-05-07T20:33:01.1551493Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1551568Z     
2025-05-07T20:33:01.1551661Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1551666Z 
2025-05-07T20:33:01.1551759Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1551907Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1552014Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1552213Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1552710Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1552806Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1553164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1553381Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1553719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1553817Z     kernel = self.compile(
2025-05-07T20:33:01.1554214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1554392Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1554519Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1554565Z 
2025-05-07T20:33:01.1554765Z self = <triton.compiler.compiler.ASTSource object at 0x7f378910dad0>
2025-05-07T20:33:01.1555544Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1556044Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37890ca660>}
2025-05-07T20:33:01.1556781Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1557006Z context = <triton._C.libtriton.ir.context object at 0x7f35d7e77df0>
2025-05-07T20:33:01.1557014Z 
2025-05-07T20:33:01.1557175Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1557443Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1557546Z                            module_map=module_map)
2025-05-07T20:33:01.1557707Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1557803Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1557881Z E       ^
2025-05-07T20:33:01.1558236Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1558241Z 
2025-05-07T20:33:01.1558673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1558680Z 
2025-05-07T20:33:01.1558789Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1559009Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1559087Z     T=16384,
2025-05-07T20:33:01.1559173Z     D=5120,
2025-05-07T20:33:01.1559257Z     scale_ub=1200.0,
2025-05-07T20:33:01.1559361Z     contiguous=False,
2025-05-07T20:33:01.1559481Z     compiled=True,
2025-05-07T20:33:01.1559583Z )
2025-05-07T20:33:01.1559827Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1560011Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:01.1560016Z 
2025-05-07T20:33:01.1560090Z     @given(
2025-05-07T20:33:01.1560216Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1560311Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1560422Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1560542Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1560653Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1560724Z     )
2025-05-07T20:33:01.1561133Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1561227Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1561301Z         self,
2025-05-07T20:33:01.1561379Z         T: int,
2025-05-07T20:33:01.1561452Z         D: int,
2025-05-07T20:33:01.1561542Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1561634Z         contiguous: bool,
2025-05-07T20:33:01.1561716Z         compiled: bool,
2025-05-07T20:33:01.1561796Z     ) -> None:
2025-05-07T20:33:01.1561886Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1561956Z     
2025-05-07T20:33:01.1562123Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1562192Z     
2025-05-07T20:33:01.1562278Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1562405Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1562496Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1562574Z         x0 = x[:, :D]
2025-05-07T20:33:01.1562654Z         x1 = x[:, D:]
2025-05-07T20:33:01.1562729Z     
2025-05-07T20:33:01.1562809Z         if contiguous:
2025-05-07T20:33:01.1562943Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1563027Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1563102Z     
2025-05-07T20:33:01.1563188Z         if scale_ub is not None:
2025-05-07T20:33:01.1563289Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1563423Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1563500Z             )
2025-05-07T20:33:01.1563575Z         else:
2025-05-07T20:33:01.1563669Z             scale_ub_tensor = None
2025-05-07T20:33:01.1563738Z     
2025-05-07T20:33:01.1563865Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1563952Z             op = silu_mul_quant
2025-05-07T20:33:01.1564033Z             if compiled:
2025-05-07T20:33:01.1564198Z                 op = torch.compile(op)
2025-05-07T20:33:01.1564304Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1564375Z     
2025-05-07T20:33:01.1564471Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1564478Z 
2025-05-07T20:33:01.1564567Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1564693Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1564794Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1564888Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1565252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:01.1565347Z     return fn(*args, **kwargs)
2025-05-07T20:33:01.1565831Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1565929Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1566284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1566504Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1566848Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1566940Z     kernel = self.compile(
2025-05-07T20:33:01.1567319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1567493Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1567615Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1567620Z 
2025-05-07T20:33:01.1567820Z self = <triton.compiler.compiler.ASTSource object at 0x7f3788b97f50>
2025-05-07T20:33:01.1568581Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1569154Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37890c8720>}
2025-05-07T20:33:01.1569894Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1570077Z context = <triton._C.libtriton.ir.context object at 0x7f378848a130>
2025-05-07T20:33:01.1570082Z 
2025-05-07T20:33:01.1570244Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1570498Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1570609Z                            module_map=module_map)
2025-05-07T20:33:01.1570767Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1570860Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1570945Z E       ^
2025-05-07T20:33:01.1571338Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1571342Z 
2025-05-07T20:33:01.1571754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1571759Z 
2025-05-07T20:33:01.1571862Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1572078Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1572158Z     T=2048,
2025-05-07T20:33:01.1572235Z     D=7168,
2025-05-07T20:33:01.1572312Z     scale_ub=1200.0,
2025-05-07T20:33:01.1572398Z     contiguous=False,
2025-05-07T20:33:01.1572478Z     compiled=True,
2025-05-07T20:33:01.1572546Z )
2025-05-07T20:33:01.1572812Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1572986Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:01.1572991Z 
2025-05-07T20:33:01.1573064Z     @given(
2025-05-07T20:33:01.1573184Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1573279Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1573393Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1573507Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1573618Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1573693Z     )
2025-05-07T20:33:01.1573931Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1574020Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1574100Z         self,
2025-05-07T20:33:01.1574175Z         T: int,
2025-05-07T20:33:01.1574249Z         D: int,
2025-05-07T20:33:01.1574349Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1574435Z         contiguous: bool,
2025-05-07T20:33:01.1574516Z         compiled: bool,
2025-05-07T20:33:01.1574600Z     ) -> None:
2025-05-07T20:33:01.1574686Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1574759Z     
2025-05-07T20:33:01.1574921Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1574992Z     
2025-05-07T20:33:01.1575083Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1575205Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1575293Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1575372Z         x0 = x[:, :D]
2025-05-07T20:33:01.1575447Z         x1 = x[:, D:]
2025-05-07T20:33:01.1575515Z     
2025-05-07T20:33:01.1575599Z         if contiguous:
2025-05-07T20:33:01.1575684Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1575770Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1575843Z     
2025-05-07T20:33:01.1575927Z         if scale_ub is not None:
2025-05-07T20:33:01.1576037Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1576166Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1576324Z             )
2025-05-07T20:33:01.1576409Z         else:
2025-05-07T20:33:01.1576499Z             scale_ub_tensor = None
2025-05-07T20:33:01.1576568Z     
2025-05-07T20:33:01.1576698Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1576783Z             op = silu_mul_quant
2025-05-07T20:33:01.1576864Z             if compiled:
2025-05-07T20:33:01.1576968Z                 op = torch.compile(op)
2025-05-07T20:33:01.1577068Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1577135Z     
2025-05-07T20:33:01.1577226Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1577230Z 
2025-05-07T20:33:01.1577323Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1577452Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1577550Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1577643Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1578016Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:01.1578145Z     return fn(*args, **kwargs)
2025-05-07T20:33:01.1578627Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1578723Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1579074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1579297Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1579632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1579720Z     kernel = self.compile(
2025-05-07T20:33:01.1580107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1580321Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1580445Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1580459Z 
2025-05-07T20:33:01.1580656Z self = <triton.compiler.compiler.ASTSource object at 0x7f378910f650>
2025-05-07T20:33:01.1581416Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1581952Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3788de1b20>}
2025-05-07T20:33:01.1582685Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1582877Z context = <triton._C.libtriton.ir.context object at 0x7f378849f8b0>
2025-05-07T20:33:01.1582883Z 
2025-05-07T20:33:01.1583040Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1583294Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1583400Z                            module_map=module_map)
2025-05-07T20:33:01.1583559Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1583653Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1583727Z E       ^
2025-05-07T20:33:01.1584074Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1584078Z 
2025-05-07T20:33:01.1584492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1584500Z 
2025-05-07T20:33:01.1584599Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1584891Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1584971Z     T=1,
2025-05-07T20:33:01.1585042Z     D=5120,
2025-05-07T20:33:01.1585127Z     scale_ub=None,
2025-05-07T20:33:01.1585208Z     contiguous=False,
2025-05-07T20:33:01.1585285Z     compiled=False,
2025-05-07T20:33:01.1585356Z )
2025-05-07T20:33:01.1585569Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1585731Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:01.1585735Z 
2025-05-07T20:33:01.1585815Z     @given(
2025-05-07T20:33:01.1585938Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1586032Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1586150Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1586266Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1586379Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1586457Z     )
2025-05-07T20:33:01.1586694Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1586835Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1586911Z         self,
2025-05-07T20:33:01.1586983Z         T: int,
2025-05-07T20:33:01.1587059Z         D: int,
2025-05-07T20:33:01.1587154Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1587237Z         contiguous: bool,
2025-05-07T20:33:01.1587323Z         compiled: bool,
2025-05-07T20:33:01.1587552Z     ) -> None:
2025-05-07T20:33:01.1587643Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1587719Z     
2025-05-07T20:33:01.1587884Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1587958Z     
2025-05-07T20:33:01.1588047Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1588221Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1588310Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1588392Z         x0 = x[:, :D]
2025-05-07T20:33:01.1588476Z         x1 = x[:, D:]
2025-05-07T20:33:01.1588555Z     
2025-05-07T20:33:01.1588635Z         if contiguous:
2025-05-07T20:33:01.1588722Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1588813Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1588884Z     
2025-05-07T20:33:01.1588968Z         if scale_ub is not None:
2025-05-07T20:33:01.1589078Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1589214Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1589300Z             )
2025-05-07T20:33:01.1589374Z         else:
2025-05-07T20:33:01.1589464Z             scale_ub_tensor = None
2025-05-07T20:33:01.1589540Z     
2025-05-07T20:33:01.1589668Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1589754Z             op = silu_mul_quant
2025-05-07T20:33:01.1589850Z             if compiled:
2025-05-07T20:33:01.1589948Z                 op = torch.compile(op)
2025-05-07T20:33:01.1590055Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1590162Z     
2025-05-07T20:33:01.1590286Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1590292Z 
2025-05-07T20:33:01.1590423Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1590567Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1590666Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1590767Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1591261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1591353Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1591716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1591939Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1592386Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1592480Z     kernel = self.compile(
2025-05-07T20:33:01.1592861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1593039Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1593161Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1593165Z 
2025-05-07T20:33:01.1593365Z self = <triton.compiler.compiler.ASTSource object at 0x7f35d7c28fd0>
2025-05-07T20:33:01.1594146Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1594651Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3788de0ae0>}
2025-05-07T20:33:01.1595458Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1595646Z context = <triton._C.libtriton.ir.context object at 0x7f37889bf1b0>
2025-05-07T20:33:01.1595651Z 
2025-05-07T20:33:01.1595816Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1596073Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1596178Z                            module_map=module_map)
2025-05-07T20:33:01.1596342Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1596479Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1596556Z E       ^
2025-05-07T20:33:01.1596923Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1596930Z 
2025-05-07T20:33:01.1597368Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1597373Z 
2025-05-07T20:33:01.1597477Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1597695Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1597770Z     T=4096,
2025-05-07T20:33:01.1597851Z     D=7168,
2025-05-07T20:33:01.1597931Z     scale_ub=1200.0,
2025-05-07T20:33:01.1598014Z     contiguous=False,
2025-05-07T20:33:01.1598101Z     compiled=False,
2025-05-07T20:33:01.1598170Z )
2025-05-07T20:33:01.1598392Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1598566Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:01.1598574Z 
2025-05-07T20:33:01.1598649Z     @given(
2025-05-07T20:33:01.1598778Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1598881Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1598992Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1599111Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1599219Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1599292Z     )
2025-05-07T20:33:01.1599538Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1599630Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1599712Z         self,
2025-05-07T20:33:01.1599787Z         T: int,
2025-05-07T20:33:01.1599864Z         D: int,
2025-05-07T20:33:01.1599963Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1600048Z         contiguous: bool,
2025-05-07T20:33:01.1600132Z         compiled: bool,
2025-05-07T20:33:01.1600221Z     ) -> None:
2025-05-07T20:33:01.1600315Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1600388Z     
2025-05-07T20:33:01.1600639Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1600720Z     
2025-05-07T20:33:01.1600808Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1600980Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1601101Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1601216Z         x0 = x[:, :D]
2025-05-07T20:33:01.1601327Z         x1 = x[:, D:]
2025-05-07T20:33:01.1601413Z     
2025-05-07T20:33:01.1601500Z         if contiguous:
2025-05-07T20:33:01.1601586Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1601669Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1601749Z     
2025-05-07T20:33:01.1601834Z         if scale_ub is not None:
2025-05-07T20:33:01.1601935Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1602081Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1602160Z             )
2025-05-07T20:33:01.1602236Z         else:
2025-05-07T20:33:01.1602360Z             scale_ub_tensor = None
2025-05-07T20:33:01.1602466Z     
2025-05-07T20:33:01.1602701Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1602824Z             op = silu_mul_quant
2025-05-07T20:33:01.1602934Z             if compiled:
2025-05-07T20:33:01.1603051Z                 op = torch.compile(op)
2025-05-07T20:33:01.1603157Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1603223Z     
2025-05-07T20:33:01.1603317Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1603322Z 
2025-05-07T20:33:01.1603415Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1603539Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1603640Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1603735Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1604279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1604380Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1604734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1604962Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1605300Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1605389Z     kernel = self.compile(
2025-05-07T20:33:01.1605790Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1605963Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1606090Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1606098Z 
2025-05-07T20:33:01.1606294Z self = <triton.compiler.compiler.ASTSource object at 0x7f37889daf50>
2025-05-07T20:33:01.1607064Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1607563Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3788f9b880>}
2025-05-07T20:33:01.1608298Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1608485Z context = <triton._C.libtriton.ir.context object at 0x7f378899fa70>
2025-05-07T20:33:01.1608489Z 
2025-05-07T20:33:01.1608647Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1608904Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1609093Z                            module_map=module_map)
2025-05-07T20:33:01.1609255Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1609355Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1609431Z E       ^
2025-05-07T20:33:01.1609778Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1609782Z 
2025-05-07T20:33:01.1610217Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1610222Z 
2025-05-07T20:33:01.1610321Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1610541Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1610616Z     T=16384,
2025-05-07T20:33:01.1610692Z     D=7168,
2025-05-07T20:33:01.1610777Z     scale_ub=None,
2025-05-07T20:33:01.1610860Z     contiguous=True,
2025-05-07T20:33:01.1610940Z     compiled=True,
2025-05-07T20:33:01.1611017Z )
2025-05-07T20:33:01.1611229Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1611443Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:01.1611447Z 
2025-05-07T20:33:01.1611528Z     @given(
2025-05-07T20:33:01.1611643Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1611743Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1611852Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1611964Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1612076Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1612143Z     )
2025-05-07T20:33:01.1612380Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1612519Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1612592Z         self,
2025-05-07T20:33:01.1612667Z         T: int,
2025-05-07T20:33:01.1612746Z         D: int,
2025-05-07T20:33:01.1612843Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1612929Z         contiguous: bool,
2025-05-07T20:33:01.1613015Z         compiled: bool,
2025-05-07T20:33:01.1613088Z     ) -> None:
2025-05-07T20:33:01.1613184Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1613256Z     
2025-05-07T20:33:01.1613417Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1613491Z     
2025-05-07T20:33:01.1613580Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1613699Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1613788Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1613862Z         x0 = x[:, :D]
2025-05-07T20:33:01.1613937Z         x1 = x[:, D:]
2025-05-07T20:33:01.1614012Z     
2025-05-07T20:33:01.1614091Z         if contiguous:
2025-05-07T20:33:01.1614181Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1614274Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1614341Z     
2025-05-07T20:33:01.1614438Z         if scale_ub is not None:
2025-05-07T20:33:01.1614546Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1614676Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1614757Z             )
2025-05-07T20:33:01.1614831Z         else:
2025-05-07T20:33:01.1614921Z             scale_ub_tensor = None
2025-05-07T20:33:01.1614999Z     
2025-05-07T20:33:01.1615124Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1615210Z             op = silu_mul_quant
2025-05-07T20:33:01.1615298Z             if compiled:
2025-05-07T20:33:01.1615394Z                 op = torch.compile(op)
2025-05-07T20:33:01.1615495Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1615567Z     
2025-05-07T20:33:01.1615651Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1615661Z 
2025-05-07T20:33:01.1615760Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1615884Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1616064Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1616167Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1616531Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:01.1616620Z     return fn(*args, **kwargs)
2025-05-07T20:33:01.1617111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1617204Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1617562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1617783Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1618122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1618218Z     kernel = self.compile(
2025-05-07T20:33:01.1618620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1618830Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1618959Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1618963Z 
2025-05-07T20:33:01.1619158Z self = <triton.compiler.compiler.ASTSource object at 0x7f3788e29e50>
2025-05-07T20:33:01.1619925Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1620415Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3788f9a980>}
2025-05-07T20:33:01.1621200Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1621386Z context = <triton._C.libtriton.ir.context object at 0x7f3788942a70>
2025-05-07T20:33:01.1621390Z 
2025-05-07T20:33:01.1621547Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1621808Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1621912Z                            module_map=module_map)
2025-05-07T20:33:01.1622072Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1622167Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1622243Z E       ^
2025-05-07T20:33:01.1622597Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1622604Z 
2025-05-07T20:33:01.1623023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1623030Z 
2025-05-07T20:33:01.1623127Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1623347Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1623422Z     T=4096,
2025-05-07T20:33:01.1623499Z     D=5120,
2025-05-07T20:33:01.1623580Z     scale_ub=None,
2025-05-07T20:33:01.1623662Z     contiguous=False,
2025-05-07T20:33:01.1623745Z     compiled=True,
2025-05-07T20:33:01.1623816Z )
2025-05-07T20:33:01.1624025Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1624199Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:01.1624204Z 
2025-05-07T20:33:01.1624276Z     @given(
2025-05-07T20:33:01.1624398Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1624499Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1624728Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1624850Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1624958Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1625027Z     )
2025-05-07T20:33:01.1625266Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1625355Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1625427Z         self,
2025-05-07T20:33:01.1625506Z         T: int,
2025-05-07T20:33:01.1625580Z         D: int,
2025-05-07T20:33:01.1625672Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1625763Z         contiguous: bool,
2025-05-07T20:33:01.1625844Z         compiled: bool,
2025-05-07T20:33:01.1625918Z     ) -> None:
2025-05-07T20:33:01.1626013Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1626083Z     
2025-05-07T20:33:01.1626252Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1626322Z     
2025-05-07T20:33:01.1626418Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1626590Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1626676Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1626752Z         x0 = x[:, :D]
2025-05-07T20:33:01.1626833Z         x1 = x[:, D:]
2025-05-07T20:33:01.1626907Z     
2025-05-07T20:33:01.1626987Z         if contiguous:
2025-05-07T20:33:01.1627079Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1627165Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1627236Z     
2025-05-07T20:33:01.1627327Z         if scale_ub is not None:
2025-05-07T20:33:01.1627556Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1627708Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1627782Z             )
2025-05-07T20:33:01.1627856Z         else:
2025-05-07T20:33:01.1628011Z             scale_ub_tensor = None
2025-05-07T20:33:01.1628082Z     
2025-05-07T20:33:01.1628211Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1628313Z             op = silu_mul_quant
2025-05-07T20:33:01.1628400Z             if compiled:
2025-05-07T20:33:01.1628493Z                 op = torch.compile(op)
2025-05-07T20:33:01.1628602Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1628670Z     
2025-05-07T20:33:01.1628759Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1628763Z 
2025-05-07T20:33:01.1628861Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1628985Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1629094Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1629188Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1629550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:01.1629650Z     return fn(*args, **kwargs)
2025-05-07T20:33:01.1630141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1630235Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1630598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1630816Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1631159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1631248Z     kernel = self.compile(
2025-05-07T20:33:01.1631648Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1631823Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1631950Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1631957Z 
2025-05-07T20:33:01.1632161Z self = <triton.compiler.compiler.ASTSource object at 0x7f37ae3c2cd0>
2025-05-07T20:33:01.1633018Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1633516Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37885a3920>}
2025-05-07T20:33:01.1634257Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1634444Z context = <triton._C.libtriton.ir.context object at 0x7f35d793ceb0>
2025-05-07T20:33:01.1634451Z 
2025-05-07T20:33:01.1634617Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1634879Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1635021Z                            module_map=module_map)
2025-05-07T20:33:01.1635185Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1635279Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1635361Z E       ^
2025-05-07T20:33:01.1635710Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1635715Z 
2025-05-07T20:33:01.1636119Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1636123Z 
2025-05-07T20:33:01.1636231Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1636448Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1636564Z     T=4096,
2025-05-07T20:33:01.1636647Z     D=5120,
2025-05-07T20:33:01.1636727Z     scale_ub=1200.0,
2025-05-07T20:33:01.1636820Z     contiguous=False,
2025-05-07T20:33:01.1636910Z     compiled=False,
2025-05-07T20:33:01.1636984Z )
2025-05-07T20:33:01.1637204Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1637376Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:01.1637381Z 
2025-05-07T20:33:01.1637455Z     @given(
2025-05-07T20:33:01.1637574Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1637674Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1637785Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1637904Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1638013Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1638093Z     )
2025-05-07T20:33:01.1638332Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1638425Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1638507Z         self,
2025-05-07T20:33:01.1638589Z         T: int,
2025-05-07T20:33:01.1638666Z         D: int,
2025-05-07T20:33:01.1638767Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1638852Z         contiguous: bool,
2025-05-07T20:33:01.1638935Z         compiled: bool,
2025-05-07T20:33:01.1639017Z     ) -> None:
2025-05-07T20:33:01.1639110Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1639180Z     
2025-05-07T20:33:01.1639349Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1639422Z     
2025-05-07T20:33:01.1639515Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1639634Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1639718Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1639803Z         x0 = x[:, :D]
2025-05-07T20:33:01.1639880Z         x1 = x[:, D:]
2025-05-07T20:33:01.1639953Z     
2025-05-07T20:33:01.1640046Z         if contiguous:
2025-05-07T20:33:01.1640543Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1640886Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1640962Z     
2025-05-07T20:33:01.1641048Z         if scale_ub is not None:
2025-05-07T20:33:01.1641150Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1641291Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1641365Z             )
2025-05-07T20:33:01.1641447Z         else:
2025-05-07T20:33:01.1641536Z             scale_ub_tensor = None
2025-05-07T20:33:01.1641605Z     
2025-05-07T20:33:01.1641737Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1641822Z             op = silu_mul_quant
2025-05-07T20:33:01.1641901Z             if compiled:
2025-05-07T20:33:01.1642000Z                 op = torch.compile(op)
2025-05-07T20:33:01.1642100Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1642169Z     
2025-05-07T20:33:01.1642263Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1642267Z 
2025-05-07T20:33:01.1642360Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1642489Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1642656Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1642751Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1643262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1643355Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1648324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1648579Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1648923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1649126Z     kernel = self.compile(
2025-05-07T20:33:01.1649523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1649708Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1649844Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1649850Z 
2025-05-07T20:33:01.1650049Z self = <triton.compiler.compiler.ASTSource object at 0x7f37889da9d0>
2025-05-07T20:33:01.1650819Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1651321Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37885b42c0>}
2025-05-07T20:33:01.1652177Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1652410Z context = <triton._C.libtriton.ir.context object at 0x7f3788813ef0>
2025-05-07T20:33:01.1652416Z 
2025-05-07T20:33:01.1652579Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1652837Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1652943Z                            module_map=module_map)
2025-05-07T20:33:01.1653101Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1653204Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1653280Z E       ^
2025-05-07T20:33:01.1653632Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1653639Z 
2025-05-07T20:33:01.1654078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1654083Z 
2025-05-07T20:33:01.1654277Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1654512Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1654592Z     T=4096,
2025-05-07T20:33:01.1654668Z     D=5120,
2025-05-07T20:33:01.1654757Z     scale_ub=1200.0,
2025-05-07T20:33:01.1654843Z     contiguous=False,
2025-05-07T20:33:01.1654925Z     compiled=True,
2025-05-07T20:33:01.1655004Z )
2025-05-07T20:33:01.1655218Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1655391Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:01.1655402Z 
2025-05-07T20:33:01.1655478Z     @given(
2025-05-07T20:33:01.1655594Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1655700Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1655814Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1655927Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1656054Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1656176Z     )
2025-05-07T20:33:01.1656460Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1656562Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1656639Z         self,
2025-05-07T20:33:01.1656723Z         T: int,
2025-05-07T20:33:01.1656801Z         D: int,
2025-05-07T20:33:01.1656902Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1656999Z         contiguous: bool,
2025-05-07T20:33:01.1657088Z         compiled: bool,
2025-05-07T20:33:01.1657173Z     ) -> None:
2025-05-07T20:33:01.1657268Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1657338Z     
2025-05-07T20:33:01.1657501Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1657623Z     
2025-05-07T20:33:01.1657715Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1657833Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1657929Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1658009Z         x0 = x[:, :D]
2025-05-07T20:33:01.1658086Z         x1 = x[:, D:]
2025-05-07T20:33:01.1658159Z     
2025-05-07T20:33:01.1658239Z         if contiguous:
2025-05-07T20:33:01.1658331Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1658414Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1658483Z     
2025-05-07T20:33:01.1658575Z         if scale_ub is not None:
2025-05-07T20:33:01.1658676Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1658808Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1658885Z             )
2025-05-07T20:33:01.1658964Z         else:
2025-05-07T20:33:01.1659058Z             scale_ub_tensor = None
2025-05-07T20:33:01.1659134Z     
2025-05-07T20:33:01.1659260Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1659348Z             op = silu_mul_quant
2025-05-07T20:33:01.1659436Z             if compiled:
2025-05-07T20:33:01.1659540Z                 op = torch.compile(op)
2025-05-07T20:33:01.1659656Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1659729Z     
2025-05-07T20:33:01.1659814Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1659818Z 
2025-05-07T20:33:01.1659918Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1660043Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1660138Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1660238Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1660604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:01.1660693Z     return fn(*args, **kwargs)
2025-05-07T20:33:01.1661186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1661283Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1661770Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1661992Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1662329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1662426Z     kernel = self.compile(
2025-05-07T20:33:01.1662872Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1663108Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1663272Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1663278Z 
2025-05-07T20:33:01.1663532Z self = <triton.compiler.compiler.ASTSource object at 0x7f37ae08ed50>
2025-05-07T20:33:01.1664508Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1665086Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37885b5b20>}
2025-05-07T20:33:01.1665831Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1666019Z context = <triton._C.libtriton.ir.context object at 0x7f35d7a7f3b0>
2025-05-07T20:33:01.1666024Z 
2025-05-07T20:33:01.1666183Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1666492Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1666596Z                            module_map=module_map)
2025-05-07T20:33:01.1666761Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1666857Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1666929Z E       ^
2025-05-07T20:33:01.1667287Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1667292Z 
2025-05-07T20:33:01.1667816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1667821Z 
2025-05-07T20:33:01.1667927Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1668141Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1668218Z     T=2048,
2025-05-07T20:33:01.1668301Z     D=7168,
2025-05-07T20:33:01.1668380Z     scale_ub=1200.0,
2025-05-07T20:33:01.1668469Z     contiguous=False,
2025-05-07T20:33:01.1668558Z     compiled=False,
2025-05-07T20:33:01.1668630Z )
2025-05-07T20:33:01.1668846Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1669026Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:01.1669030Z 
2025-05-07T20:33:01.1669108Z     @given(
2025-05-07T20:33:01.1669229Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1669326Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1669437Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1669558Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1669667Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1669741Z     )
2025-05-07T20:33:01.1669993Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1670083Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1670159Z         self,
2025-05-07T20:33:01.1670242Z         T: int,
2025-05-07T20:33:01.1670316Z         D: int,
2025-05-07T20:33:01.1670497Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1670592Z         contiguous: bool,
2025-05-07T20:33:01.1670677Z         compiled: bool,
2025-05-07T20:33:01.1670761Z     ) -> None:
2025-05-07T20:33:01.1670853Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1670925Z     
2025-05-07T20:33:01.1671094Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1671166Z     
2025-05-07T20:33:01.1671253Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1671380Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1671466Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1671542Z         x0 = x[:, :D]
2025-05-07T20:33:01.1671624Z         x1 = x[:, D:]
2025-05-07T20:33:01.1671693Z     
2025-05-07T20:33:01.1671771Z         if contiguous:
2025-05-07T20:33:01.1671864Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1671950Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1672024Z     
2025-05-07T20:33:01.1672109Z         if scale_ub is not None:
2025-05-07T20:33:01.1672214Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1672391Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1672464Z             )
2025-05-07T20:33:01.1672540Z         else:
2025-05-07T20:33:01.1672635Z             scale_ub_tensor = None
2025-05-07T20:33:01.1672704Z     
2025-05-07T20:33:01.1672829Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1672922Z             op = silu_mul_quant
2025-05-07T20:33:01.1673002Z             if compiled:
2025-05-07T20:33:01.1673097Z                 op = torch.compile(op)
2025-05-07T20:33:01.1673204Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1673275Z     
2025-05-07T20:33:01.1673366Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1673371Z 
2025-05-07T20:33:01.1673464Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1673632Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1673735Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1673834Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1674328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1674426Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1674779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1675002Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1675337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1675426Z     kernel = self.compile(
2025-05-07T20:33:01.1675810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1675983Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1676109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1676122Z 
2025-05-07T20:33:01.1676320Z self = <triton.compiler.compiler.ASTSource object at 0x7f37aec56f50>
2025-05-07T20:33:01.1677083Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1677578Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37885b6700>}
2025-05-07T20:33:01.1678308Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1678582Z context = <triton._C.libtriton.ir.context object at 0x7f35d79c2830>
2025-05-07T20:33:01.1678587Z 
2025-05-07T20:33:01.1678748Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1679002Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1679110Z                            module_map=module_map)
2025-05-07T20:33:01.1679267Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1679362Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1679444Z E       ^
2025-05-07T20:33:01.1679790Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1679795Z 
2025-05-07T20:33:01.1680210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1680217Z 
2025-05-07T20:33:01.1680314Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1680534Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1680655Z     T=1,
2025-05-07T20:33:01.1680730Z     D=7168,
2025-05-07T20:33:01.1680818Z     scale_ub=None,
2025-05-07T20:33:01.1680899Z     contiguous=True,
2025-05-07T20:33:01.1680978Z     compiled=False,
2025-05-07T20:33:01.1681058Z )
2025-05-07T20:33:01.1681272Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1681432Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:01.1681437Z 
2025-05-07T20:33:01.1681519Z     @given(
2025-05-07T20:33:01.1681633Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1681728Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1681847Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1682003Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1682122Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1682191Z     )
2025-05-07T20:33:01.1682433Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1682529Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1682604Z         self,
2025-05-07T20:33:01.1682679Z         T: int,
2025-05-07T20:33:01.1682760Z         D: int,
2025-05-07T20:33:01.1682852Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1682939Z         contiguous: bool,
2025-05-07T20:33:01.1683026Z         compiled: bool,
2025-05-07T20:33:01.1683101Z     ) -> None:
2025-05-07T20:33:01.1683190Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1683265Z     
2025-05-07T20:33:01.1683427Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1683507Z     
2025-05-07T20:33:01.1683593Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1683712Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1683808Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1683887Z         x0 = x[:, :D]
2025-05-07T20:33:01.1683966Z         x1 = x[:, D:]
2025-05-07T20:33:01.1684041Z     
2025-05-07T20:33:01.1684125Z         if contiguous:
2025-05-07T20:33:01.1684213Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1684304Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1684375Z     
2025-05-07T20:33:01.1684460Z         if scale_ub is not None:
2025-05-07T20:33:01.1684567Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1684696Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1684769Z             )
2025-05-07T20:33:01.1684847Z         else:
2025-05-07T20:33:01.1684939Z             scale_ub_tensor = None
2025-05-07T20:33:01.1685015Z     
2025-05-07T20:33:01.1685138Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1685224Z             op = silu_mul_quant
2025-05-07T20:33:01.1685316Z             if compiled:
2025-05-07T20:33:01.1685411Z                 op = torch.compile(op)
2025-05-07T20:33:01.1685513Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1685675Z     
2025-05-07T20:33:01.1685765Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1685772Z 
2025-05-07T20:33:01.1685864Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1685994Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1686088Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1686188Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1686676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1686768Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1687130Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1687346Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1687688Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1687783Z     kernel = self.compile(
2025-05-07T20:33:01.1688219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1688396Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1688518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1688523Z 
2025-05-07T20:33:01.1688719Z self = <triton.compiler.compiler.ASTSource object at 0x7f3789cbf550>
2025-05-07T20:33:01.1689490Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1690160Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37885b7a60>}
2025-05-07T20:33:01.1690903Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1691092Z context = <triton._C.libtriton.ir.context object at 0x7f35d7976bf0>
2025-05-07T20:33:01.1691096Z 
2025-05-07T20:33:01.1691261Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1691517Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1691621Z                            module_map=module_map)
2025-05-07T20:33:01.1691783Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1691879Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1691955Z E       ^
2025-05-07T20:33:01.1692306Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1692318Z 
2025-05-07T20:33:01.1692729Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1692734Z 
2025-05-07T20:33:01.1692839Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1693053Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1693130Z     T=16384,
2025-05-07T20:33:01.1693209Z     D=7168,
2025-05-07T20:33:01.1693285Z     scale_ub=1200.0,
2025-05-07T20:33:01.1693369Z     contiguous=False,
2025-05-07T20:33:01.1693454Z     compiled=True,
2025-05-07T20:33:01.1693523Z )
2025-05-07T20:33:01.1693738Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1693918Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:01.1693925Z 
2025-05-07T20:33:01.1693999Z     @given(
2025-05-07T20:33:01.1694118Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1694294Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1694408Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1694526Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1694633Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1694706Z     )
2025-05-07T20:33:01.1694951Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1695040Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1695114Z         self,
2025-05-07T20:33:01.1695197Z         T: int,
2025-05-07T20:33:01.1695274Z         D: int,
2025-05-07T20:33:01.1695378Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1695464Z         contiguous: bool,
2025-05-07T20:33:01.1695548Z         compiled: bool,
2025-05-07T20:33:01.1695635Z     ) -> None:
2025-05-07T20:33:01.1695726Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1695793Z     
2025-05-07T20:33:01.1695969Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1696038Z     
2025-05-07T20:33:01.1696168Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1696296Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1696381Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1696459Z         x0 = x[:, :D]
2025-05-07T20:33:01.1696544Z         x1 = x[:, D:]
2025-05-07T20:33:01.1696615Z     
2025-05-07T20:33:01.1696700Z         if contiguous:
2025-05-07T20:33:01.1696786Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1696871Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1696948Z     
2025-05-07T20:33:01.1697034Z         if scale_ub is not None:
2025-05-07T20:33:01.1697137Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1697276Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1697393Z             )
2025-05-07T20:33:01.1697468Z         else:
2025-05-07T20:33:01.1697569Z             scale_ub_tensor = None
2025-05-07T20:33:01.1697643Z     
2025-05-07T20:33:01.1697778Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1697872Z             op = silu_mul_quant
2025-05-07T20:33:01.1697958Z             if compiled:
2025-05-07T20:33:01.1698064Z                 op = torch.compile(op)
2025-05-07T20:33:01.1698166Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1698237Z     
2025-05-07T20:33:01.1698329Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1698333Z 
2025-05-07T20:33:01.1698424Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1698549Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1698650Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1698745Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1699107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:01.1699203Z     return fn(*args, **kwargs)
2025-05-07T20:33:01.1699689Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1699787Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1700141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1700358Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1700698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1700789Z     kernel = self.compile(
2025-05-07T20:33:01.1701192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1701359Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1701488Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1701492Z 
2025-05-07T20:33:01.1701778Z self = <triton.compiler.compiler.ASTSource object at 0x7f35d7c2af50>
2025-05-07T20:33:01.1702540Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1703035Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f35d7950d60>}
2025-05-07T20:33:01.1703766Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1703950Z context = <triton._C.libtriton.ir.context object at 0x7f37880d94b0>
2025-05-07T20:33:01.1703957Z 
2025-05-07T20:33:01.1704123Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1704385Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1704542Z                            module_map=module_map)
2025-05-07T20:33:01.1704699Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1704793Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1704870Z E       ^
2025-05-07T20:33:01.1705217Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1705221Z 
2025-05-07T20:33:01.1705632Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1705641Z 
2025-05-07T20:33:01.1705739Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1705996Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1706077Z     T=1,
2025-05-07T20:33:01.1706149Z     D=7168,
2025-05-07T20:33:01.1706235Z     scale_ub=None,
2025-05-07T20:33:01.1706327Z     contiguous=False,
2025-05-07T20:33:01.1706406Z     compiled=False,
2025-05-07T20:33:01.1706475Z )
2025-05-07T20:33:01.1706690Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1706854Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:01.1706861Z 
2025-05-07T20:33:01.1706939Z     @given(
2025-05-07T20:33:01.1707051Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1707146Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1707261Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1707373Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1707624Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1707707Z     )
2025-05-07T20:33:01.1707947Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1708041Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1708122Z         self,
2025-05-07T20:33:01.1708200Z         T: int,
2025-05-07T20:33:01.1708270Z         D: int,
2025-05-07T20:33:01.1708368Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1708453Z         contiguous: bool,
2025-05-07T20:33:01.1708542Z         compiled: bool,
2025-05-07T20:33:01.1708615Z     ) -> None:
2025-05-07T20:33:01.1708707Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1708781Z     
2025-05-07T20:33:01.1708964Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1709034Z     
2025-05-07T20:33:01.1709125Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1709246Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1709329Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1709410Z         x0 = x[:, :D]
2025-05-07T20:33:01.1709490Z         x1 = x[:, D:]
2025-05-07T20:33:01.1709559Z     
2025-05-07T20:33:01.1709645Z         if contiguous:
2025-05-07T20:33:01.1709823Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1709917Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1709986Z     
2025-05-07T20:33:01.1710072Z         if scale_ub is not None:
2025-05-07T20:33:01.1710182Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1710316Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1710391Z             )
2025-05-07T20:33:01.1710472Z         else:
2025-05-07T20:33:01.1710563Z             scale_ub_tensor = None
2025-05-07T20:33:01.1710634Z     
2025-05-07T20:33:01.1710767Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1710855Z             op = silu_mul_quant
2025-05-07T20:33:01.1710935Z             if compiled:
2025-05-07T20:33:01.1711038Z                 op = torch.compile(op)
2025-05-07T20:33:01.1711143Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1711222Z     
2025-05-07T20:33:01.1711308Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1711313Z 
2025-05-07T20:33:01.1711409Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1711589Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1711686Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1711783Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1712277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1712370Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1712732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1712948Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1713284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1713424Z     kernel = self.compile(
2025-05-07T20:33:01.1713809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1713981Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1714109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1714114Z 
2025-05-07T20:33:01.1714309Z self = <triton.compiler.compiler.ASTSource object at 0x7f3788f120d0>
2025-05-07T20:33:01.1715075Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1715564Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f35d7951760>}
2025-05-07T20:33:01.1716309Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1716495Z context = <triton._C.libtriton.ir.context object at 0x7f37880a6430>
2025-05-07T20:33:01.1716500Z 
2025-05-07T20:33:01.1716656Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1716917Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1717022Z                            module_map=module_map)
2025-05-07T20:33:01.1717178Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1717278Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1717353Z E       ^
2025-05-07T20:33:01.1717704Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1717712Z 
2025-05-07T20:33:01.1718223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1718231Z 
2025-05-07T20:33:01.1718330Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1718551Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1718627Z     T=2048,
2025-05-07T20:33:01.1718708Z     D=7168,
2025-05-07T20:33:01.1718786Z     scale_ub=None,
2025-05-07T20:33:01.1718867Z     contiguous=False,
2025-05-07T20:33:01.1718951Z     compiled=True,
2025-05-07T20:33:01.1719020Z )
2025-05-07T20:33:01.1719230Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1719406Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:01.1719410Z 
2025-05-07T20:33:01.1719485Z     @given(
2025-05-07T20:33:01.1719598Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1719706Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1719817Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1719941Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1720127Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1720200Z     )
2025-05-07T20:33:01.1720443Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1720534Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1720608Z         self,
2025-05-07T20:33:01.1720688Z         T: int,
2025-05-07T20:33:01.1720760Z         D: int,
2025-05-07T20:33:01.1720851Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1720942Z         contiguous: bool,
2025-05-07T20:33:01.1721023Z         compiled: bool,
2025-05-07T20:33:01.1721098Z     ) -> None:
2025-05-07T20:33:01.1721196Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1721266Z     
2025-05-07T20:33:01.1721473Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1721556Z     
2025-05-07T20:33:01.1721642Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1721773Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1721860Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1721937Z         x0 = x[:, :D]
2025-05-07T20:33:01.1722017Z         x1 = x[:, D:]
2025-05-07T20:33:01.1722085Z     
2025-05-07T20:33:01.1722163Z         if contiguous:
2025-05-07T20:33:01.1722256Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1722341Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1722411Z     
2025-05-07T20:33:01.1722501Z         if scale_ub is not None:
2025-05-07T20:33:01.1722607Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1722737Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1722817Z             )
2025-05-07T20:33:01.1722893Z         else:
2025-05-07T20:33:01.1722990Z             scale_ub_tensor = None
2025-05-07T20:33:01.1723063Z     
2025-05-07T20:33:01.1723189Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1723285Z             op = silu_mul_quant
2025-05-07T20:33:01.1723370Z             if compiled:
2025-05-07T20:33:01.1723473Z                 op = torch.compile(op)
2025-05-07T20:33:01.1723581Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1723652Z     
2025-05-07T20:33:01.1723740Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1723744Z 
2025-05-07T20:33:01.1723844Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1723968Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1724070Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1724165Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1724523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:01.1724621Z     return fn(*args, **kwargs)
2025-05-07T20:33:01.1725107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1725292Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1725656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1725872Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1726214Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1726305Z     kernel = self.compile(
2025-05-07T20:33:01.1726685Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1726863Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1726987Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1726994Z 
2025-05-07T20:33:01.1727196Z self = <triton.compiler.compiler.ASTSource object at 0x7f3788688350>
2025-05-07T20:33:01.1727961Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1728496Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f35d7952f20>}
2025-05-07T20:33:01.1729235Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1729420Z context = <triton._C.libtriton.ir.context object at 0x7f3788033770>
2025-05-07T20:33:01.1729424Z 
2025-05-07T20:33:01.1729587Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1729883Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1729992Z                            module_map=module_map)
2025-05-07T20:33:01.1730156Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1730250Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1730321Z E       ^
2025-05-07T20:33:01.1730673Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1730677Z 
2025-05-07T20:33:01.1731087Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1731091Z 
2025-05-07T20:33:01.1731195Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1731410Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1731480Z     T=4096,
2025-05-07T20:33:01.1731562Z     D=7168,
2025-05-07T20:33:01.1731640Z     scale_ub=None,
2025-05-07T20:33:01.1731721Z     contiguous=False,
2025-05-07T20:33:01.1731807Z     compiled=True,
2025-05-07T20:33:01.1731881Z )
2025-05-07T20:33:01.1732100Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1732269Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:01.1732274Z 
2025-05-07T20:33:01.1732348Z     @given(
2025-05-07T20:33:01.1732472Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1732566Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1732676Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1732793Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1732901Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1732979Z     )
2025-05-07T20:33:01.1733217Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1733307Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1733390Z         self,
2025-05-07T20:33:01.1733465Z         T: int,
2025-05-07T20:33:01.1733622Z         D: int,
2025-05-07T20:33:01.1733723Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1733811Z         contiguous: bool,
2025-05-07T20:33:01.1733890Z         compiled: bool,
2025-05-07T20:33:01.1733971Z     ) -> None:
2025-05-07T20:33:01.1734064Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1734133Z     
2025-05-07T20:33:01.1734300Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1734369Z     
2025-05-07T20:33:01.1734455Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1734581Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1734663Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1734744Z         x0 = x[:, :D]
2025-05-07T20:33:01.1734819Z         x1 = x[:, D:]
2025-05-07T20:33:01.1734891Z     
2025-05-07T20:33:01.1734976Z         if contiguous:
2025-05-07T20:33:01.1735064Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1735148Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1735220Z     
2025-05-07T20:33:01.1735311Z         if scale_ub is not None:
2025-05-07T20:33:01.1735457Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1735592Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1735667Z             )
2025-05-07T20:33:01.1735741Z         else:
2025-05-07T20:33:01.1735840Z             scale_ub_tensor = None
2025-05-07T20:33:01.1735911Z     
2025-05-07T20:33:01.1736041Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1736127Z             op = silu_mul_quant
2025-05-07T20:33:01.1736206Z             if compiled:
2025-05-07T20:33:01.1736308Z                 op = torch.compile(op)
2025-05-07T20:33:01.1736410Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1736479Z     
2025-05-07T20:33:01.1736571Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1736620Z 
2025-05-07T20:33:01.1736715Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1736848Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1736952Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1737050Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1737421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:01.1737510Z     return fn(*args, **kwargs)
2025-05-07T20:33:01.1737994Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1738095Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1738449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1738666Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1739009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1739103Z     kernel = self.compile(
2025-05-07T20:33:01.1739496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1739667Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1739791Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1739795Z 
2025-05-07T20:33:01.1739998Z self = <triton.compiler.compiler.ASTSource object at 0x7f3788f10550>
2025-05-07T20:33:01.1741261Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1741793Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f37888280e0>}
2025-05-07T20:33:01.1742786Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1742978Z context = <triton._C.libtriton.ir.context object at 0x7f378888b0f0>
2025-05-07T20:33:01.1742990Z 
2025-05-07T20:33:01.1743150Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1743405Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1743514Z                            module_map=module_map)
2025-05-07T20:33:01.1743671Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1743765Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1743846Z E       ^
2025-05-07T20:33:01.1744197Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1744202Z 
2025-05-07T20:33:01.1744620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1744698Z 
2025-05-07T20:33:01.1744797Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1745013Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1745099Z     T=16384,
2025-05-07T20:33:01.1745176Z     D=5120,
2025-05-07T20:33:01.1745253Z     scale_ub=1200.0,
2025-05-07T20:33:01.1745343Z     contiguous=False,
2025-05-07T20:33:01.1745419Z     compiled=False,
2025-05-07T20:33:01.1745487Z )
2025-05-07T20:33:01.1745705Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1745881Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:01.1745886Z 
2025-05-07T20:33:01.1746033Z     @given(
2025-05-07T20:33:01.1746146Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1746243Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1746366Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1746481Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1746588Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1746662Z     )
2025-05-07T20:33:01.1746900Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1746997Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1747070Z         self,
2025-05-07T20:33:01.1747145Z         T: int,
2025-05-07T20:33:01.1747225Z         D: int,
2025-05-07T20:33:01.1747318Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1747470Z         contiguous: bool,
2025-05-07T20:33:01.1747557Z         compiled: bool,
2025-05-07T20:33:01.1747631Z     ) -> None:
2025-05-07T20:33:01.1747719Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1747798Z     
2025-05-07T20:33:01.1747960Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1748027Z     
2025-05-07T20:33:01.1748124Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1748247Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1748331Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1748412Z         x0 = x[:, :D]
2025-05-07T20:33:01.1748488Z         x1 = x[:, D:]
2025-05-07T20:33:01.1748560Z     
2025-05-07T20:33:01.1748639Z         if contiguous:
2025-05-07T20:33:01.1748725Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1748814Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1748882Z     
2025-05-07T20:33:01.1748968Z         if scale_ub is not None:
2025-05-07T20:33:01.1749075Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1749208Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1749284Z             )
2025-05-07T20:33:01.1749360Z         else:
2025-05-07T20:33:01.1749448Z             scale_ub_tensor = None
2025-05-07T20:33:01.1749519Z     
2025-05-07T20:33:01.1749774Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1749862Z             op = silu_mul_quant
2025-05-07T20:33:01.1749953Z             if compiled:
2025-05-07T20:33:01.1750047Z                 op = torch.compile(op)
2025-05-07T20:33:01.1750148Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1750220Z     
2025-05-07T20:33:01.1750306Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1750311Z 
2025-05-07T20:33:01.1750402Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1750536Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1750634Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1750733Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1751228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1751322Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1751686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1751942Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1752274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1752371Z     kernel = self.compile(
2025-05-07T20:33:01.1752766Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1752942Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1753064Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1753069Z 
2025-05-07T20:33:01.1753267Z self = <triton.compiler.compiler.ASTSource object at 0x7f37886896d0>
2025-05-07T20:33:01.1754083Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1754576Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f3788828b80>}
2025-05-07T20:33:01.1755314Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1755498Z context = <triton._C.libtriton.ir.context object at 0x7f3788858eb0>
2025-05-07T20:33:01.1755503Z 
2025-05-07T20:33:01.1755663Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1755925Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1756031Z                            module_map=module_map)
2025-05-07T20:33:01.1756199Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1756295Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1756369Z E       ^
2025-05-07T20:33:01.1756724Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1756729Z 
2025-05-07T20:33:01.1757138Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1757142Z 
2025-05-07T20:33:01.1757247Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1757467Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1757539Z     T=16384,
2025-05-07T20:33:01.1757615Z     D=5120,
2025-05-07T20:33:01.1757693Z     scale_ub=1200.0,
2025-05-07T20:33:01.1757773Z     contiguous=True,
2025-05-07T20:33:01.1757863Z     compiled=True,
2025-05-07T20:33:01.1757932Z )
2025-05-07T20:33:01.1758230Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1758409Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:01.1758416Z 
2025-05-07T20:33:01.1758490Z     @given(
2025-05-07T20:33:01.1758611Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1758706Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1758817Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1758938Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1759047Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1759119Z     )
2025-05-07T20:33:01.1759364Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1759455Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1759530Z         self,
2025-05-07T20:33:01.1759614Z         T: int,
2025-05-07T20:33:01.1759688Z         D: int,
2025-05-07T20:33:01.1759782Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1759876Z         contiguous: bool,
2025-05-07T20:33:01.1759961Z         compiled: bool,
2025-05-07T20:33:01.1760087Z     ) -> None:
2025-05-07T20:33:01.1760177Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1760246Z     
2025-05-07T20:33:01.1760414Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1760484Z     
2025-05-07T20:33:01.1760572Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1760699Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1760785Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1760863Z         x0 = x[:, :D]
2025-05-07T20:33:01.1760944Z         x1 = x[:, D:]
2025-05-07T20:33:01.1761011Z     
2025-05-07T20:33:01.1761090Z         if contiguous:
2025-05-07T20:33:01.1761185Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1761270Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1761386Z     
2025-05-07T20:33:01.1761476Z         if scale_ub is not None:
2025-05-07T20:33:01.1761576Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1761717Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1761792Z             )
2025-05-07T20:33:01.1761866Z         else:
2025-05-07T20:33:01.1761962Z             scale_ub_tensor = None
2025-05-07T20:33:01.1762029Z     
2025-05-07T20:33:01.1762158Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1762248Z             op = silu_mul_quant
2025-05-07T20:33:01.1762330Z             if compiled:
2025-05-07T20:33:01.1762425Z                 op = torch.compile(op)
2025-05-07T20:33:01.1762533Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1762601Z     
2025-05-07T20:33:01.1762689Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1762701Z 
2025-05-07T20:33:01.1762795Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1762919Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1763022Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1763123Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1763486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:01.1763583Z     return fn(*args, **kwargs)
2025-05-07T20:33:01.1764065Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1764164Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1764516Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1764733Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1765077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1765168Z     kernel = self.compile(
2025-05-07T20:33:01.1765657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1765898Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1766066Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1766071Z 
2025-05-07T20:33:01.1766274Z self = <triton.compiler.compiler.ASTSource object at 0x7f37889db050>
2025-05-07T20:33:01.1767126Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1767617Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f378882a2a0>}
2025-05-07T20:33:01.1768364Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1768608Z context = <triton._C.libtriton.ir.context object at 0x7f35d7807f30>
2025-05-07T20:33:01.1768613Z 
2025-05-07T20:33:01.1768779Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1769035Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1769138Z                            module_map=module_map)
2025-05-07T20:33:01.1773840Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1773941Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1774024Z E       ^
2025-05-07T20:33:01.1774379Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1774459Z 
2025-05-07T20:33:01.1774883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1774892Z 
2025-05-07T20:33:01.1775001Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1775223Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1775308Z     T=16384,
2025-05-07T20:33:01.1775389Z     D=5120,
2025-05-07T20:33:01.1775470Z     scale_ub=None,
2025-05-07T20:33:01.1775562Z     contiguous=False,
2025-05-07T20:33:01.1775643Z     compiled=True,
2025-05-07T20:33:01.1775714Z )
2025-05-07T20:33:01.1775937Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1776112Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:01.1776116Z 
2025-05-07T20:33:01.1776193Z     @given(
2025-05-07T20:33:01.1776317Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1776419Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1776538Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1776657Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1776769Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1776849Z     )
2025-05-07T20:33:01.1777089Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1777180Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1777259Z         self,
2025-05-07T20:33:01.1777336Z         T: int,
2025-05-07T20:33:01.1777412Z         D: int,
2025-05-07T20:33:01.1777512Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1777600Z         contiguous: bool,
2025-05-07T20:33:01.1777681Z         compiled: bool,
2025-05-07T20:33:01.1777766Z     ) -> None:
2025-05-07T20:33:01.1777861Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1777931Z     
2025-05-07T20:33:01.1778103Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1778180Z     
2025-05-07T20:33:01.1778277Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1778489Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1778579Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1778667Z         x0 = x[:, :D]
2025-05-07T20:33:01.1778744Z         x1 = x[:, D:]
2025-05-07T20:33:01.1778813Z     
2025-05-07T20:33:01.1778903Z         if contiguous:
2025-05-07T20:33:01.1778991Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1779080Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1779157Z     
2025-05-07T20:33:01.1779245Z         if scale_ub is not None:
2025-05-07T20:33:01.1779348Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1779486Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1779562Z             )
2025-05-07T20:33:01.1779640Z         else:
2025-05-07T20:33:01.1779731Z             scale_ub_tensor = None
2025-05-07T20:33:01.1779802Z     
2025-05-07T20:33:01.1779938Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1780027Z             op = silu_mul_quant
2025-05-07T20:33:01.1780109Z             if compiled:
2025-05-07T20:33:01.1780218Z                 op = torch.compile(op)
2025-05-07T20:33:01.1780366Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1780440Z     
2025-05-07T20:33:01.1780534Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1780539Z 
2025-05-07T20:33:01.1780634Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1780766Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1780864Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1780961Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1781338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:01.1781429Z     return fn(*args, **kwargs)
2025-05-07T20:33:01.1781915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1782059Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1782420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1782649Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1782987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1783076Z     kernel = self.compile(
2025-05-07T20:33:01.1783461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1783634Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1783760Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1783771Z 
2025-05-07T20:33:01.1783971Z self = <triton.compiler.compiler.ASTSource object at 0x7f3788e283d0>
2025-05-07T20:33:01.1784738Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1785236Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f378882b060>}
2025-05-07T20:33:01.1785966Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1786160Z context = <triton._C.libtriton.ir.context object at 0x7f35d78236f0>
2025-05-07T20:33:01.1786165Z 
2025-05-07T20:33:01.1786322Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1786578Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1786804Z                            module_map=module_map)
2025-05-07T20:33:01.1786963Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1787058Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1787138Z E       ^
2025-05-07T20:33:01.1787624Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1787630Z 
2025-05-07T20:33:01.1788048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1788052Z 
2025-05-07T20:33:01.1788151Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1788371Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1788454Z     T=2048,
2025-05-07T20:33:01.1788529Z     D=5120,
2025-05-07T20:33:01.1788615Z     scale_ub=None,
2025-05-07T20:33:01.1788697Z     contiguous=False,
2025-05-07T20:33:01.1788775Z     compiled=True,
2025-05-07T20:33:01.1788850Z )
2025-05-07T20:33:01.1789069Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1789290Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:01.1789295Z 
2025-05-07T20:33:01.1789375Z     @given(
2025-05-07T20:33:01.1789490Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1789587Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1789704Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1789818Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1789935Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1790008Z     )
2025-05-07T20:33:01.1790244Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1790340Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1790458Z         self,
2025-05-07T20:33:01.1790533Z         T: int,
2025-05-07T20:33:01.1790614Z         D: int,
2025-05-07T20:33:01.1790711Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1790799Z         contiguous: bool,
2025-05-07T20:33:01.1790887Z         compiled: bool,
2025-05-07T20:33:01.1790963Z     ) -> None:
2025-05-07T20:33:01.1791055Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1791129Z     
2025-05-07T20:33:01.1791290Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1791370Z     
2025-05-07T20:33:01.1791458Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1791576Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1791667Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1791743Z         x0 = x[:, :D]
2025-05-07T20:33:01.1791819Z         x1 = x[:, D:]
2025-05-07T20:33:01.1791896Z     
2025-05-07T20:33:01.1791976Z         if contiguous:
2025-05-07T20:33:01.1792064Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1792162Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1792229Z     
2025-05-07T20:33:01.1792316Z         if scale_ub is not None:
2025-05-07T20:33:01.1792432Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1792564Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1792636Z             )
2025-05-07T20:33:01.1792712Z         else:
2025-05-07T20:33:01.1792799Z             scale_ub_tensor = None
2025-05-07T20:33:01.1792873Z     
2025-05-07T20:33:01.1792998Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1793083Z             op = silu_mul_quant
2025-05-07T20:33:01.1793169Z             if compiled:
2025-05-07T20:33:01.1793263Z                 op = torch.compile(op)
2025-05-07T20:33:01.1793364Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1793442Z     
2025-05-07T20:33:01.1793529Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1793534Z 
2025-05-07T20:33:01.1793627Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1793757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1793938Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1794038Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1794404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:01.1794493Z     return fn(*args, **kwargs)
2025-05-07T20:33:01.1794987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1795080Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1795432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1795656Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1795990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1796087Z     kernel = self.compile(
2025-05-07T20:33:01.1796489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1796699Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1796829Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1796833Z 
2025-05-07T20:33:01.1797030Z self = <triton.compiler.compiler.ASTSource object at 0x7f35d7c29b50>
2025-05-07T20:33:01.1797797Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1798287Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f35d78f07c0>}
2025-05-07T20:33:01.1799066Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1799261Z context = <triton._C.libtriton.ir.context object at 0x7f35d7bf70b0>
2025-05-07T20:33:01.1799265Z 
2025-05-07T20:33:01.1799422Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1799681Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1799784Z                            module_map=module_map)
2025-05-07T20:33:01.1799940Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1800038Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1800112Z E       ^
2025-05-07T20:33:01.1800458Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1800471Z 
2025-05-07T20:33:01.1800886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1800893Z 
2025-05-07T20:33:01.1800994Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1801214Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1801292Z     T=2048,
2025-05-07T20:33:01.1801363Z     D=5120,
2025-05-07T20:33:01.1801448Z     scale_ub=1200.0,
2025-05-07T20:33:01.1801530Z     contiguous=False,
2025-05-07T20:33:01.1801608Z     compiled=True,
2025-05-07T20:33:01.1801681Z )
2025-05-07T20:33:01.1801892Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1802068Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:01.1802073Z 
2025-05-07T20:33:01.1802147Z     @given(
2025-05-07T20:33:01.1802262Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1802367Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1802562Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1802675Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1802793Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1802865Z     )
2025-05-07T20:33:01.1803109Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1803196Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1803270Z         self,
2025-05-07T20:33:01.1803350Z         T: int,
2025-05-07T20:33:01.1803423Z         D: int,
2025-05-07T20:33:01.1803515Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1803610Z         contiguous: bool,
2025-05-07T20:33:01.1803690Z         compiled: bool,
2025-05-07T20:33:01.1803765Z     ) -> None:
2025-05-07T20:33:01.1803862Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1803931Z     
2025-05-07T20:33:01.1804098Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1804177Z     
2025-05-07T20:33:01.1804265Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1804392Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1804526Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1804603Z         x0 = x[:, :D]
2025-05-07T20:33:01.1804684Z         x1 = x[:, D:]
2025-05-07T20:33:01.1804754Z     
2025-05-07T20:33:01.1804832Z         if contiguous:
2025-05-07T20:33:01.1804925Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1805010Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1805079Z     
2025-05-07T20:33:01.1805178Z         if scale_ub is not None:
2025-05-07T20:33:01.1805280Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1805410Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1805492Z             )
2025-05-07T20:33:01.1805566Z         else:
2025-05-07T20:33:01.1805656Z             scale_ub_tensor = None
2025-05-07T20:33:01.1805775Z     
2025-05-07T20:33:01.1805901Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1805998Z             op = silu_mul_quant
2025-05-07T20:33:01.1806080Z             if compiled:
2025-05-07T20:33:01.1806179Z                 op = torch.compile(op)
2025-05-07T20:33:01.1806288Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1806356Z     
2025-05-07T20:33:01.1806444Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1806448Z 
2025-05-07T20:33:01.1806549Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1806673Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1806767Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1806866Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1807231Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:01.1807326Z     return fn(*args, **kwargs)
2025-05-07T20:33:01.1807814Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1807911Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1808273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1808492Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1808824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1808920Z     kernel = self.compile(
2025-05-07T20:33:01.1809316Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1809491Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1809612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1809619Z 
2025-05-07T20:33:01.1809814Z self = <triton.compiler.compiler.ASTSource object at 0x7f3789030450>
2025-05-07T20:33:01.1810667Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1811162Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f35d78f1580>}
2025-05-07T20:33:01.1811902Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1812086Z context = <triton._C.libtriton.ir.context object at 0x7f35d7b0e6f0>
2025-05-07T20:33:01.1812090Z 
2025-05-07T20:33:01.1812256Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1812510Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1812618Z                            module_map=module_map)
2025-05-07T20:33:01.1812825Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1812919Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1812993Z E       ^
2025-05-07T20:33:01.1813346Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1813351Z 
2025-05-07T20:33:01.1813783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1813788Z 
2025-05-07T20:33:01.1813890Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1814105Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1814257Z     T=4096,
2025-05-07T20:33:01.1814339Z     D=5120,
2025-05-07T20:33:01.1814418Z     scale_ub=1200.0,
2025-05-07T20:33:01.1814498Z     contiguous=True,
2025-05-07T20:33:01.1814584Z     compiled=True,
2025-05-07T20:33:01.1814651Z )
2025-05-07T20:33:01.1814866Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1815037Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:01.1815042Z 
2025-05-07T20:33:01.1815115Z     @given(
2025-05-07T20:33:01.1815235Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1815329Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1815441Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1815557Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1815667Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1815739Z     )
2025-05-07T20:33:01.1815984Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1816076Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1816155Z         self,
2025-05-07T20:33:01.1816229Z         T: int,
2025-05-07T20:33:01.1816319Z         D: int,
2025-05-07T20:33:01.1816422Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1816506Z         contiguous: bool,
2025-05-07T20:33:01.1816586Z         compiled: bool,
2025-05-07T20:33:01.1816668Z     ) -> None:
2025-05-07T20:33:01.1816759Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1816829Z     
2025-05-07T20:33:01.1816996Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1817068Z     
2025-05-07T20:33:01.1817157Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1817281Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1817367Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1817447Z         x0 = x[:, :D]
2025-05-07T20:33:01.1817528Z         x1 = x[:, D:]
2025-05-07T20:33:01.1817596Z     
2025-05-07T20:33:01.1817689Z         if contiguous:
2025-05-07T20:33:01.1817779Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1817863Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1818026Z     
2025-05-07T20:33:01.1818121Z         if scale_ub is not None:
2025-05-07T20:33:01.1818234Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1818384Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1818460Z             )
2025-05-07T20:33:01.1818539Z         else:
2025-05-07T20:33:01.1818641Z             scale_ub_tensor = None
2025-05-07T20:33:01.1818714Z     
2025-05-07T20:33:01.1818850Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1818950Z             op = silu_mul_quant
2025-05-07T20:33:01.1819036Z             if compiled:
2025-05-07T20:33:01.1819143Z                 op = torch.compile(op)
2025-05-07T20:33:01.1819254Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1819327Z     
2025-05-07T20:33:01.1819426Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1819433Z 
2025-05-07T20:33:01.1819533Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1819677Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1819912Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1820006Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1820370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:01.1820466Z     return fn(*args, **kwargs)
2025-05-07T20:33:01.1820951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1821049Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1821401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1821618Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1822000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1822096Z     kernel = self.compile(
2025-05-07T20:33:01.1822501Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1822671Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1822793Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1822797Z 
2025-05-07T20:33:01.1822999Z self = <triton.compiler.compiler.ASTSource object at 0x7f37882eaed0>
2025-05-07T20:33:01.1823757Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1824254Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f35d78f2840>}
2025-05-07T20:33:01.1824992Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1825178Z context = <triton._C.libtriton.ir.context object at 0x7f35d7bbe570>
2025-05-07T20:33:01.1825182Z 
2025-05-07T20:33:01.1825345Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1825600Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1825708Z                            module_map=module_map)
2025-05-07T20:33:01.1825864Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1825957Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1826037Z E       ^
2025-05-07T20:33:01.1826385Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1826390Z 
2025-05-07T20:33:01.1826950Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1826957Z 
2025-05-07T20:33:01.1827055Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1827271Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1827347Z     T=128,
2025-05-07T20:33:01.1827540Z     D=5120,
2025-05-07T20:33:01.1827649Z     scale_ub=1200.0,
2025-05-07T20:33:01.1827767Z     contiguous=False,
2025-05-07T20:33:01.1827879Z     compiled=True,
2025-05-07T20:33:01.1827960Z )
2025-05-07T20:33:01.1828180Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1828346Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:01.1828350Z 
2025-05-07T20:33:01.1828428Z     @given(
2025-05-07T20:33:01.1828542Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1828639Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1828760Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1828926Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1829033Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1829109Z     )
2025-05-07T20:33:01.1829355Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1829442Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1829520Z         self,
2025-05-07T20:33:01.1829592Z         T: int,
2025-05-07T20:33:01.1829670Z         D: int,
2025-05-07T20:33:01.1829762Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1829848Z         contiguous: bool,
2025-05-07T20:33:01.1829937Z         compiled: bool,
2025-05-07T20:33:01.1830011Z     ) -> None:
2025-05-07T20:33:01.1830101Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1830226Z     
2025-05-07T20:33:01.1830388Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1830459Z     
2025-05-07T20:33:01.1830556Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1830678Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1830764Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1830850Z         x0 = x[:, :D]
2025-05-07T20:33:01.1830926Z         x1 = x[:, D:]
2025-05-07T20:33:01.1830993Z     
2025-05-07T20:33:01.1831077Z         if contiguous:
2025-05-07T20:33:01.1831168Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1831259Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1831335Z     
2025-05-07T20:33:01.1831421Z         if scale_ub is not None:
2025-05-07T20:33:01.1831535Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1831663Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1831736Z             )
2025-05-07T20:33:01.1831818Z         else:
2025-05-07T20:33:01.1831907Z             scale_ub_tensor = None
2025-05-07T20:33:01.1831977Z     
2025-05-07T20:33:01.1832114Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1832200Z             op = silu_mul_quant
2025-05-07T20:33:01.1832282Z             if compiled:
2025-05-07T20:33:01.1832385Z                 op = torch.compile(op)
2025-05-07T20:33:01.1832489Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1832564Z     
2025-05-07T20:33:01.1832653Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1832657Z 
2025-05-07T20:33:01.1832750Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1832881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1832978Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1833074Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1833448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:01.1833539Z     return fn(*args, **kwargs)
2025-05-07T20:33:01.1834120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1834218Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1834572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1834798Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1835137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1835231Z     kernel = self.compile(
2025-05-07T20:33:01.1835619Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1835790Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1835924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1835929Z 
2025-05-07T20:33:01.1836131Z self = <triton.compiler.compiler.ASTSource object at 0x7f37889da950>
2025-05-07T20:33:01.1836895Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1837449Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f35d78f34c0>}
2025-05-07T20:33:01.1838184Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1838384Z context = <triton._C.libtriton.ir.context object at 0x7f35d7661070>
2025-05-07T20:33:01.1838426Z 
2025-05-07T20:33:01.1838588Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1838860Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1838965Z                            module_map=module_map)
2025-05-07T20:33:01.1839122Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1839220Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1839295Z E       ^
2025-05-07T20:33:01.1839644Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1839649Z 
2025-05-07T20:33:01.1840540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1840550Z 
2025-05-07T20:33:01.1840697Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1840925Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1841005Z     T=16384,
2025-05-07T20:33:01.1841081Z     D=7168,
2025-05-07T20:33:01.1841169Z     scale_ub=1200.0,
2025-05-07T20:33:01.1841256Z     contiguous=True,
2025-05-07T20:33:01.1841337Z     compiled=True,
2025-05-07T20:33:01.1841413Z )
2025-05-07T20:33:01.1841627Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1841798Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:01.1841803Z 
2025-05-07T20:33:01.1841882Z     @given(
2025-05-07T20:33:01.1841996Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1842096Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1842214Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1842326Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1842441Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1842515Z     )
2025-05-07T20:33:01.1842756Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1842850Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1843167Z         self,
2025-05-07T20:33:01.1843245Z         T: int,
2025-05-07T20:33:01.1843329Z         D: int,
2025-05-07T20:33:01.1843424Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1843513Z         contiguous: bool,
2025-05-07T20:33:01.1843594Z         compiled: bool,
2025-05-07T20:33:01.1843669Z     ) -> None:
2025-05-07T20:33:01.1843767Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1843837Z     
2025-05-07T20:33:01.1844000Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1844077Z     
2025-05-07T20:33:01.1844165Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1844285Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1844376Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1844455Z         x0 = x[:, :D]
2025-05-07T20:33:01.1844535Z         x1 = x[:, D:]
2025-05-07T20:33:01.1844612Z     
2025-05-07T20:33:01.1844693Z         if contiguous:
2025-05-07T20:33:01.1844780Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1844879Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1845051Z     
2025-05-07T20:33:01.1845141Z         if scale_ub is not None:
2025-05-07T20:33:01.1845245Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1845375Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1845454Z             )
2025-05-07T20:33:01.1845528Z         else:
2025-05-07T20:33:01.1845620Z             scale_ub_tensor = None
2025-05-07T20:33:01.1845696Z     
2025-05-07T20:33:01.1845822Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1845907Z             op = silu_mul_quant
2025-05-07T20:33:01.1845994Z             if compiled:
2025-05-07T20:33:01.1846091Z                 op = torch.compile(op)
2025-05-07T20:33:01.1846191Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1846333Z     
2025-05-07T20:33:01.1846422Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1846426Z 
2025-05-07T20:33:01.1846528Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1846661Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1846768Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1846870Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1847236Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:01.1847326Z     return fn(*args, **kwargs)
2025-05-07T20:33:01.1847822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1847913Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1848277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1848500Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1848842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1848942Z     kernel = self.compile(
2025-05-07T20:33:01.1849342Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1849513Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1849644Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1849648Z 
2025-05-07T20:33:01.1849847Z self = <triton.compiler.compiler.ASTSource object at 0x7f35d7b1b150>
2025-05-07T20:33:01.1850618Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1851200Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f35d76fcc20>}
2025-05-07T20:33:01.1851944Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1852131Z context = <triton._C.libtriton.ir.context object at 0x7f35d7644970>
2025-05-07T20:33:01.1852136Z 
2025-05-07T20:33:01.1852294Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1852560Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1852665Z                            module_map=module_map)
2025-05-07T20:33:01.1852828Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1852921Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1853004Z E       ^
2025-05-07T20:33:01.1853362Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1853367Z 
2025-05-07T20:33:01.1853825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1853829Z 
2025-05-07T20:33:01.1853928Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1854151Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1854225Z     T=16384,
2025-05-07T20:33:01.1854306Z     D=5120,
2025-05-07T20:33:01.1854391Z     scale_ub=1200.0,
2025-05-07T20:33:01.1854471Z     contiguous=True,
2025-05-07T20:33:01.1854564Z     compiled=False,
2025-05-07T20:33:01.1854634Z )
2025-05-07T20:33:01.1854847Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1855026Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:01.1855074Z 
2025-05-07T20:33:01.1855148Z     @given(
2025-05-07T20:33:01.1855269Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1855369Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1855482Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1855604Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1855712Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1855784Z     )
2025-05-07T20:33:01.1856028Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1856117Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1856191Z         self,
2025-05-07T20:33:01.1856272Z         T: int,
2025-05-07T20:33:01.1856349Z         D: int,
2025-05-07T20:33:01.1856444Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1856535Z         contiguous: bool,
2025-05-07T20:33:01.1856627Z         compiled: bool,
2025-05-07T20:33:01.1856707Z     ) -> None:
2025-05-07T20:33:01.1856804Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1856874Z     
2025-05-07T20:33:01.1857048Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1857126Z     
2025-05-07T20:33:01.1857245Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1857417Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1857530Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1857639Z         x0 = x[:, :D]
2025-05-07T20:33:01.1857753Z         x1 = x[:, D:]
2025-05-07T20:33:01.1857850Z     
2025-05-07T20:33:01.1857963Z         if contiguous:
2025-05-07T20:33:01.1858098Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1858215Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1858310Z     
2025-05-07T20:33:01.1858433Z         if scale_ub is not None:
2025-05-07T20:33:01.1858567Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1858705Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1858782Z             )
2025-05-07T20:33:01.1858857Z         else:
2025-05-07T20:33:01.1858950Z             scale_ub_tensor = None
2025-05-07T20:33:01.1859120Z     
2025-05-07T20:33:01.1859257Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1859350Z             op = silu_mul_quant
2025-05-07T20:33:01.1859430Z             if compiled:
2025-05-07T20:33:01.1859526Z                 op = torch.compile(op)
2025-05-07T20:33:01.1859634Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1859704Z     
2025-05-07T20:33:01.1859790Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1859800Z 
2025-05-07T20:33:01.1859893Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1860017Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1860119Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1860214Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1860709Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1860811Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1861174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1861435Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1861779Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1861871Z     kernel = self.compile(
2025-05-07T20:33:01.1862275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1862444Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1862569Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1862574Z 
2025-05-07T20:33:01.1862822Z self = <triton.compiler.compiler.ASTSource object at 0x7f3789030550>
2025-05-07T20:33:01.1863593Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1864098Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f35d76fd580>}
2025-05-07T20:33:01.1864841Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1865033Z context = <triton._C.libtriton.ir.context object at 0x7f35d7410e30>
2025-05-07T20:33:01.1865039Z 
2025-05-07T20:33:01.1865205Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1865464Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1865582Z                            module_map=module_map)
2025-05-07T20:33:01.1865745Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1865841Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1865922Z E       ^
2025-05-07T20:33:01.1866268Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1866272Z 
2025-05-07T20:33:01.1866708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1866712Z 
2025-05-07T20:33:01.1866809Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1867028Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1867111Z     T=1,
2025-05-07T20:33:01.1867185Z     D=7168,
2025-05-07T20:33:01.1867270Z     scale_ub=1200.0,
2025-05-07T20:33:01.1867358Z     contiguous=False,
2025-05-07T20:33:01.1867552Z     compiled=False,
2025-05-07T20:33:01.1867634Z )
2025-05-07T20:33:01.1867941Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1868147Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:01.1868154Z 
2025-05-07T20:33:01.1868268Z     @given(
2025-05-07T20:33:01.1868424Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1868551Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1868713Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1868868Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1869011Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1869111Z     )
2025-05-07T20:33:01.1869430Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1869570Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1869677Z         self,
2025-05-07T20:33:01.1869780Z         T: int,
2025-05-07T20:33:01.1869860Z         D: int,
2025-05-07T20:33:01.1869960Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1872793Z         contiguous: bool,
2025-05-07T20:33:01.1872901Z         compiled: bool,
2025-05-07T20:33:01.1872989Z     ) -> None:
2025-05-07T20:33:01.1873083Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1873161Z     
2025-05-07T20:33:01.1873328Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1873401Z     
2025-05-07T20:33:01.1873497Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1873624Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1873714Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1873797Z         x0 = x[:, :D]
2025-05-07T20:33:01.1873873Z         x1 = x[:, D:]
2025-05-07T20:33:01.1873945Z     
2025-05-07T20:33:01.1874028Z         if contiguous:
2025-05-07T20:33:01.1874181Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1874267Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1874343Z     
2025-05-07T20:33:01.1874432Z         if scale_ub is not None:
2025-05-07T20:33:01.1874536Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1874699Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1874774Z             )
2025-05-07T20:33:01.1874848Z         else:
2025-05-07T20:33:01.1874944Z             scale_ub_tensor = None
2025-05-07T20:33:01.1875016Z     
2025-05-07T20:33:01.1875146Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1875234Z             op = silu_mul_quant
2025-05-07T20:33:01.1875317Z             if compiled:
2025-05-07T20:33:01.1875420Z                 op = torch.compile(op)
2025-05-07T20:33:01.1875521Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1875592Z     
2025-05-07T20:33:01.1875685Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1875690Z 
2025-05-07T20:33:01.1875788Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1875914Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1876019Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1876121Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1876621Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1876715Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1877063Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1877286Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1877622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1877714Z     kernel = self.compile(
2025-05-07T20:33:01.1878101Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1878278Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1878493Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1878501Z 
2025-05-07T20:33:01.1878699Z self = <triton.compiler.compiler.ASTSource object at 0x7f3788f100d0>
2025-05-07T20:33:01.1879461Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1879959Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f35d76fe8e0>}
2025-05-07T20:33:01.1880690Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1880889Z context = <triton._C.libtriton.ir.context object at 0x7f35d771c270>
2025-05-07T20:33:01.1880935Z 
2025-05-07T20:33:01.1881186Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1881448Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1881554Z                            module_map=module_map)
2025-05-07T20:33:01.1881713Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1881813Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1881889Z E       ^
2025-05-07T20:33:01.1882236Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1882241Z 
2025-05-07T20:33:01.1882657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1882702Z 
2025-05-07T20:33:01.1882803Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1883032Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1883115Z     T=4096,
2025-05-07T20:33:01.1883194Z     D=7168,
2025-05-07T20:33:01.1883286Z     scale_ub=1200.0,
2025-05-07T20:33:01.1883370Z     contiguous=False,
2025-05-07T20:33:01.1883450Z     compiled=True,
2025-05-07T20:33:01.1883529Z )
2025-05-07T20:33:01.1883743Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1883914Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:01.1883926Z 
2025-05-07T20:33:01.1884002Z     @given(
2025-05-07T20:33:01.1884116Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1884219Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1884329Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1884446Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1884567Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1884643Z     )
2025-05-07T20:33:01.1884886Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1884986Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1885062Z         self,
2025-05-07T20:33:01.1885137Z         T: int,
2025-05-07T20:33:01.1885217Z         D: int,
2025-05-07T20:33:01.1885313Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1885408Z         contiguous: bool,
2025-05-07T20:33:01.1885490Z         compiled: bool,
2025-05-07T20:33:01.1885566Z     ) -> None:
2025-05-07T20:33:01.1885662Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1885730Z     
2025-05-07T20:33:01.1885895Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1885973Z     
2025-05-07T20:33:01.1886064Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1886186Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1886275Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1886352Z         x0 = x[:, :D]
2025-05-07T20:33:01.1886478Z         x1 = x[:, D:]
2025-05-07T20:33:01.1886561Z     
2025-05-07T20:33:01.1886646Z         if contiguous:
2025-05-07T20:33:01.1886733Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1886823Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1886893Z     
2025-05-07T20:33:01.1886986Z         if scale_ub is not None:
2025-05-07T20:33:01.1887088Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1887221Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1887303Z             )
2025-05-07T20:33:01.1887374Z         else:
2025-05-07T20:33:01.1887464Z             scale_ub_tensor = None
2025-05-07T20:33:01.1887540Z     
2025-05-07T20:33:01.1887666Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1887752Z             op = silu_mul_quant
2025-05-07T20:33:01.1887845Z             if compiled:
2025-05-07T20:33:01.1887942Z                 op = torch.compile(op)
2025-05-07T20:33:01.1888048Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1888169Z     
2025-05-07T20:33:01.1888316Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1888321Z 
2025-05-07T20:33:01.1888422Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1888547Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1888646Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1888750Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1889115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:01.1889206Z     return fn(*args, **kwargs)
2025-05-07T20:33:01.1889698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1889833Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1890192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1890414Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1890753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1890847Z     kernel = self.compile(
2025-05-07T20:33:01.1891224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1891403Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1891525Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1891530Z 
2025-05-07T20:33:01.1891727Z self = <triton.compiler.compiler.ASTSource object at 0x7f35d7c29350>
2025-05-07T20:33:01.1892495Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1892993Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f35d76ffa60>}
2025-05-07T20:33:01.1893729Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1893913Z context = <triton._C.libtriton.ir.context object at 0x7f35d77af5f0>
2025-05-07T20:33:01.1893917Z 
2025-05-07T20:33:01.1894076Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1894336Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1894442Z                            module_map=module_map)
2025-05-07T20:33:01.1894605Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1894747Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1894828Z E       ^
2025-05-07T20:33:01.1895182Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1895187Z 
2025-05-07T20:33:01.1895597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1895601Z 
2025-05-07T20:33:01.1895704Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1895920Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1895997Z     T=128,
2025-05-07T20:33:01.1900972Z     D=7168,
2025-05-07T20:33:01.1901071Z     scale_ub=1200.0,
2025-05-07T20:33:01.1901154Z     contiguous=False,
2025-05-07T20:33:01.1901253Z     compiled=True,
2025-05-07T20:33:01.1901325Z )
2025-05-07T20:33:01.1901544Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1901731Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:01.1901869Z 
2025-05-07T20:33:01.1901948Z     @given(
2025-05-07T20:33:01.1902066Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1902171Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1902281Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1902398Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1902506Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1902574Z     )
2025-05-07T20:33:01.1902820Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1902912Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1902987Z         self,
2025-05-07T20:33:01.1903067Z         T: int,
2025-05-07T20:33:01.1903190Z         D: int,
2025-05-07T20:33:01.1903286Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1903380Z         contiguous: bool,
2025-05-07T20:33:01.1903466Z         compiled: bool,
2025-05-07T20:33:01.1903552Z     ) -> None:
2025-05-07T20:33:01.1903652Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1903726Z     
2025-05-07T20:33:01.1903899Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1903971Z     
2025-05-07T20:33:01.1904064Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1904194Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1904284Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1904365Z         x0 = x[:, :D]
2025-05-07T20:33:01.1904450Z         x1 = x[:, D:]
2025-05-07T20:33:01.1904522Z     
2025-05-07T20:33:01.1904605Z         if contiguous:
2025-05-07T20:33:01.1904698Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1904785Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1904861Z     
2025-05-07T20:33:01.1904955Z         if scale_ub is not None:
2025-05-07T20:33:01.1905058Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1905198Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1905278Z             )
2025-05-07T20:33:01.1905357Z         else:
2025-05-07T20:33:01.1905455Z             scale_ub_tensor = None
2025-05-07T20:33:01.1905526Z     
2025-05-07T20:33:01.1905654Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1905751Z             op = silu_mul_quant
2025-05-07T20:33:01.1905833Z             if compiled:
2025-05-07T20:33:01.1905932Z                 op = torch.compile(op)
2025-05-07T20:33:01.1906042Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1906111Z     
2025-05-07T20:33:01.1906204Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1906214Z 
2025-05-07T20:33:01.1906309Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1906436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1906543Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1906641Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1907067Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:01.1907168Z     return fn(*args, **kwargs)
2025-05-07T20:33:01.1907805Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1907899Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1908261Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1908480Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1908821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1908913Z     kernel = self.compile(
2025-05-07T20:33:01.1909297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1909480Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1909701Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1909707Z 
2025-05-07T20:33:01.1909914Z self = <triton.compiler.compiler.ASTSource object at 0x7f37ae8dcbd0>
2025-05-07T20:33:01.1910680Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1911172Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f35d74d4ea0>}
2025-05-07T20:33:01.1911916Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1912179Z context = <triton._C.libtriton.ir.context object at 0x7f35d7422c30>
2025-05-07T20:33:01.1912187Z 
2025-05-07T20:33:01.1912355Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1912611Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1912717Z                            module_map=module_map)
2025-05-07T20:33:01.1912880Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1912974Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1913053Z E       ^
2025-05-07T20:33:01.1913403Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1913408Z 
2025-05-07T20:33:01.1913826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1913831Z 
2025-05-07T20:33:01.1913938Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1914161Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1914244Z     T=2048,
2025-05-07T20:33:01.1914322Z     D=7168,
2025-05-07T20:33:01.1914401Z     scale_ub=None,
2025-05-07T20:33:01.1914490Z     contiguous=True,
2025-05-07T20:33:01.1914572Z     compiled=True,
2025-05-07T20:33:01.1914646Z )
2025-05-07T20:33:01.1914864Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1915032Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:01.1915037Z 
2025-05-07T20:33:01.1915114Z     @given(
2025-05-07T20:33:01.1915237Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1915335Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1915449Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1915571Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1915725Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1915809Z     )
2025-05-07T20:33:01.1916053Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1916144Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1916228Z         self,
2025-05-07T20:33:01.1916305Z         T: int,
2025-05-07T20:33:01.1916379Z         D: int,
2025-05-07T20:33:01.1916479Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1916569Z         contiguous: bool,
2025-05-07T20:33:01.1916653Z         compiled: bool,
2025-05-07T20:33:01.1916737Z     ) -> None:
2025-05-07T20:33:01.1916829Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1916901Z     
2025-05-07T20:33:01.1917073Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1917149Z     
2025-05-07T20:33:01.1917244Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1917366Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1917458Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1917549Z         x0 = x[:, :D]
2025-05-07T20:33:01.1917717Z         x1 = x[:, D:]
2025-05-07T20:33:01.1917790Z     
2025-05-07T20:33:01.1917879Z         if contiguous:
2025-05-07T20:33:01.1917968Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1918055Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1918134Z     
2025-05-07T20:33:01.1918221Z         if scale_ub is not None:
2025-05-07T20:33:01.1918327Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1918465Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1918542Z             )
2025-05-07T20:33:01.1918623Z         else:
2025-05-07T20:33:01.1918717Z             scale_ub_tensor = None
2025-05-07T20:33:01.1918790Z     
2025-05-07T20:33:01.1918923Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1919056Z             op = silu_mul_quant
2025-05-07T20:33:01.1919141Z             if compiled:
2025-05-07T20:33:01.1919248Z                 op = torch.compile(op)
2025-05-07T20:33:01.1919352Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1919429Z     
2025-05-07T20:33:01.1919524Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1919529Z 
2025-05-07T20:33:01.1919622Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1919754Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1919850Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1919947Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1920319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:01.1920409Z     return fn(*args, **kwargs)
2025-05-07T20:33:01.1920893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1920998Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1921355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1921582Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1921918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1922010Z     kernel = self.compile(
2025-05-07T20:33:01.1922393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1922566Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1922689Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1922693Z 
2025-05-07T20:33:01.1922897Z self = <triton.compiler.compiler.ASTSource object at 0x7f3788f105d0>
2025-05-07T20:33:01.1923709Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1924214Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f35d74d5c60>}
2025-05-07T20:33:01.1924945Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1925140Z context = <triton._C.libtriton.ir.context object at 0x7f35d7482330>
2025-05-07T20:33:01.1925144Z 
2025-05-07T20:33:01.1925304Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1925560Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1925676Z                            module_map=module_map)
2025-05-07T20:33:01.1925837Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1926017Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1926102Z E       ^
2025-05-07T20:33:01.1926450Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1926454Z 
2025-05-07T20:33:01.1926894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1926898Z 
2025-05-07T20:33:01.1926998Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1927216Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1927299Z     T=16384,
2025-05-07T20:33:01.1927372Z     D=5120,
2025-05-07T20:33:01.1927452Z     scale_ub=None,
2025-05-07T20:33:01.1927583Z     contiguous=False,
2025-05-07T20:33:01.1927664Z     compiled=False,
2025-05-07T20:33:01.1927742Z )
2025-05-07T20:33:01.1927959Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1928138Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:01.1928143Z 
2025-05-07T20:33:01.1928222Z     @given(
2025-05-07T20:33:01.1928336Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1928431Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1928547Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1928660Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1928777Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1928849Z     )
2025-05-07T20:33:01.1929088Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1929180Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1929259Z         self,
2025-05-07T20:33:01.1929333Z         T: int,
2025-05-07T20:33:01.1929412Z         D: int,
2025-05-07T20:33:01.1929507Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1929597Z         contiguous: bool,
2025-05-07T20:33:01.1929687Z         compiled: bool,
2025-05-07T20:33:01.1929766Z     ) -> None:
2025-05-07T20:33:01.1929859Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1929937Z     
2025-05-07T20:33:01.1930102Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1930176Z     
2025-05-07T20:33:01.1930270Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1930393Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1932286Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:01.1932299Z 
2025-05-07T20:33:01.1932416Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:01.1932421Z 
2025-05-07T20:33:01.1932526Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1932746Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1932826Z     T=4096,
2025-05-07T20:33:01.1932909Z     D=7168,
2025-05-07T20:33:01.1932989Z     scale_ub=1200.0,
2025-05-07T20:33:01.1933071Z     contiguous=True,
2025-05-07T20:33:01.1933158Z     compiled=True,
2025-05-07T20:33:01.1933228Z )
2025-05-07T20:33:01.1933442Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1933613Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:01.1933620Z 
2025-05-07T20:33:01.1933694Z     @given(
2025-05-07T20:33:01.1933814Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1933913Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1934112Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1934234Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1934347Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1934422Z     )
2025-05-07T20:33:01.1934669Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1934758Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1934842Z         self,
2025-05-07T20:33:01.1934917Z         T: int,
2025-05-07T20:33:01.1934993Z         D: int,
2025-05-07T20:33:01.1935097Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1935188Z         contiguous: bool,
2025-05-07T20:33:01.1935269Z         compiled: bool,
2025-05-07T20:33:01.1935395Z     ) -> None:
2025-05-07T20:33:01.1935487Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1935560Z     
2025-05-07T20:33:01.1935734Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1935808Z     
2025-05-07T20:33:01.1935900Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1936027Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1937790Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:01.1937798Z 
2025-05-07T20:33:01.1937918Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:01.1937922Z 
2025-05-07T20:33:01.1938022Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1938248Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1938326Z     T=16384,
2025-05-07T20:33:01.1938401Z     D=7168,
2025-05-07T20:33:01.1938490Z     scale_ub=None,
2025-05-07T20:33:01.1938573Z     contiguous=False,
2025-05-07T20:33:01.1938655Z     compiled=False,
2025-05-07T20:33:01.1938731Z )
2025-05-07T20:33:01.1938942Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1939113Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:01.1939117Z 
2025-05-07T20:33:01.1939196Z     @given(
2025-05-07T20:33:01.1939310Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1939411Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1939521Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1939637Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1939798Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1939871Z     )
2025-05-07T20:33:01.1940649Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1940796Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1940874Z         self,
2025-05-07T20:33:01.1940949Z         T: int,
2025-05-07T20:33:01.1941030Z         D: int,
2025-05-07T20:33:01.1941124Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1941216Z         contiguous: bool,
2025-05-07T20:33:01.1941300Z         compiled: bool,
2025-05-07T20:33:01.1941377Z     ) -> None:
2025-05-07T20:33:01.1941476Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1941547Z     
2025-05-07T20:33:01.1941710Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1943647Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:01.1943743Z 
2025-05-07T20:33:01.1943859Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:01.1943864Z 
2025-05-07T20:33:01.1943970Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1944186Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1944260Z     T=2048,
2025-05-07T20:33:01.1944341Z     D=7168,
2025-05-07T20:33:01.1944425Z     scale_ub=1200.0,
2025-05-07T20:33:01.1944510Z     contiguous=True,
2025-05-07T20:33:01.1944666Z     compiled=True,
2025-05-07T20:33:01.1944739Z )
2025-05-07T20:33:01.1944956Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1945128Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:01.1945135Z 
2025-05-07T20:33:01.1945210Z     @given(
2025-05-07T20:33:01.1945332Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1945427Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1945538Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1945659Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1945769Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1945844Z     )
2025-05-07T20:33:01.1946083Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1946174Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1946259Z         self,
2025-05-07T20:33:01.1946338Z         T: int,
2025-05-07T20:33:01.1946413Z         D: int,
2025-05-07T20:33:01.1946516Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1946607Z         contiguous: bool,
2025-05-07T20:33:01.1946689Z         compiled: bool,
2025-05-07T20:33:01.1946777Z     ) -> None:
2025-05-07T20:33:01.1946869Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1946938Z     
2025-05-07T20:33:01.1947107Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1947182Z     
2025-05-07T20:33:01.1947272Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1947463Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1949283Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:01.1949300Z 
2025-05-07T20:33:01.1949416Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:01.1949421Z 
2025-05-07T20:33:01.1949521Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1949747Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1949822Z     T=2048,
2025-05-07T20:33:01.1949898Z     D=7168,
2025-05-07T20:33:01.1949984Z     scale_ub=None,
2025-05-07T20:33:01.1950066Z     contiguous=True,
2025-05-07T20:33:01.1950151Z     compiled=False,
2025-05-07T20:33:01.1950230Z )
2025-05-07T20:33:01.1950440Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1950612Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:01.1950618Z 
2025-05-07T20:33:01.1950692Z     @given(
2025-05-07T20:33:01.1950804Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1950906Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1951108Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1951222Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1951340Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1951413Z     )
2025-05-07T20:33:01.1951651Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1951747Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1951822Z         self,
2025-05-07T20:33:01.1951906Z         T: int,
2025-05-07T20:33:01.1951977Z         D: int,
2025-05-07T20:33:01.1952070Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1952159Z         contiguous: bool,
2025-05-07T20:33:01.1952241Z         compiled: bool,
2025-05-07T20:33:01.1952317Z     ) -> None:
2025-05-07T20:33:01.1952455Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1952525Z     
2025-05-07T20:33:01.1952687Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1952769Z     
2025-05-07T20:33:01.1952860Z >       x_sign = torch.sign(x)
2025-05-07T20:33:01.1954606Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:01.1954612Z 
2025-05-07T20:33:01.1954725Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:01.1954732Z 
2025-05-07T20:33:01.1954829Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1955050Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1955127Z     T=1,
2025-05-07T20:33:01.1955203Z     D=7168,
2025-05-07T20:33:01.1955285Z     scale_ub=1200.0,
2025-05-07T20:33:01.1955367Z     contiguous=True,
2025-05-07T20:33:01.1955455Z     compiled=False,
2025-05-07T20:33:01.1955524Z )
2025-05-07T20:33:01.1955735Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1955901Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:01.1955905Z 
2025-05-07T20:33:01.1955977Z     @given(
2025-05-07T20:33:01.1956089Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1956190Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1956300Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1956419Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1956529Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1956602Z     )
2025-05-07T20:33:01.1956981Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1957078Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1957152Z         self,
2025-05-07T20:33:01.1957228Z         T: int,
2025-05-07T20:33:01.1957300Z         D: int,
2025-05-07T20:33:01.1957392Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1957478Z         contiguous: bool,
2025-05-07T20:33:01.1957559Z         compiled: bool,
2025-05-07T20:33:01.1957631Z     ) -> None:
2025-05-07T20:33:01.1957725Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1957793Z     
2025-05-07T20:33:01.1957955Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1958024Z     
2025-05-07T20:33:01.1958113Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1958241Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1958329Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1958404Z         x0 = x[:, :D]
2025-05-07T20:33:01.1958485Z         x1 = x[:, D:]
2025-05-07T20:33:01.1958553Z     
2025-05-07T20:33:01.1958636Z         if contiguous:
2025-05-07T20:33:01.1958815Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1958905Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1958975Z     
2025-05-07T20:33:01.1959070Z         if scale_ub is not None:
2025-05-07T20:33:01.1959174Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1959310Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1959384Z             )
2025-05-07T20:33:01.1959461Z         else:
2025-05-07T20:33:01.1959561Z             scale_ub_tensor = None
2025-05-07T20:33:01.1959646Z     
2025-05-07T20:33:01.1959773Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1959869Z             op = silu_mul_quant
2025-05-07T20:33:01.1959951Z             if compiled:
2025-05-07T20:33:01.1960585Z                 op = torch.compile(op)
2025-05-07T20:33:01.1960696Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1960772Z     
2025-05-07T20:33:01.1960864Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1960869Z 
2025-05-07T20:33:01.1960974Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1961100Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1961202Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1961299Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1961847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1961946Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1962308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1962524Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1962875Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1962973Z     kernel = self.compile(
2025-05-07T20:33:01.1963362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1963535Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1963657Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1963662Z 
2025-05-07T20:33:01.1963873Z self = <triton.compiler.compiler.ASTSource object at 0x7f37890cf2d0>
2025-05-07T20:33:01.1964636Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1965136Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f35d7504b80>}
2025-05-07T20:33:01.1965920Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1966115Z context = <triton._C.libtriton.ir.context object at 0x7f35d75c9470>
2025-05-07T20:33:01.1966120Z 
2025-05-07T20:33:01.1966281Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1966538Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1966650Z                            module_map=module_map)
2025-05-07T20:33:01.1966809Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1966902Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1966986Z E       ^
2025-05-07T20:33:01.1967336Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1967341Z 
2025-05-07T20:33:01.1967813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1967854Z 
2025-05-07T20:33:01.1967955Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1968173Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1968259Z     T=128,
2025-05-07T20:33:01.1968332Z     D=5120,
2025-05-07T20:33:01.1968411Z     scale_ub=None,
2025-05-07T20:33:01.1968498Z     contiguous=True,
2025-05-07T20:33:01.1968578Z     compiled=False,
2025-05-07T20:33:01.1968649Z )
2025-05-07T20:33:01.1968868Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1969036Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:01.1969041Z 
2025-05-07T20:33:01.1969156Z     @given(
2025-05-07T20:33:01.1969270Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1969369Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1969486Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1969602Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1969713Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1969791Z     )
2025-05-07T20:33:01.1970030Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1970126Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1970205Z         self,
2025-05-07T20:33:01.1970279Z         T: int,
2025-05-07T20:33:01.1970359Z         D: int,
2025-05-07T20:33:01.1970451Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1970537Z         contiguous: bool,
2025-05-07T20:33:01.1970627Z         compiled: bool,
2025-05-07T20:33:01.1970702Z     ) -> None:
2025-05-07T20:33:01.1970793Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1970875Z     
2025-05-07T20:33:01.1971037Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1971107Z     
2025-05-07T20:33:01.1971211Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1971378Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1971513Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1971645Z         x0 = x[:, :D]
2025-05-07T20:33:01.1971737Z         x1 = x[:, D:]
2025-05-07T20:33:01.1971827Z     
2025-05-07T20:33:01.1971907Z         if contiguous:
2025-05-07T20:33:01.1971994Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1972122Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1972221Z     
2025-05-07T20:33:01.1972328Z         if scale_ub is not None:
2025-05-07T20:33:01.1972438Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1972570Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1972645Z             )
2025-05-07T20:33:01.1972726Z         else:
2025-05-07T20:33:01.1972818Z             scale_ub_tensor = None
2025-05-07T20:33:01.1972892Z     
2025-05-07T20:33:01.1973100Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1973188Z             op = silu_mul_quant
2025-05-07T20:33:01.1973281Z             if compiled:
2025-05-07T20:33:01.1973377Z                 op = torch.compile(op)
2025-05-07T20:33:01.1973478Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1973554Z     
2025-05-07T20:33:01.1973645Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1973650Z 
2025-05-07T20:33:01.1973743Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1973875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1973973Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1974068Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1974566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1974664Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1975033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1975382Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1975721Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1975820Z     kernel = self.compile(
2025-05-07T20:33:01.1976199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1976376Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1976500Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1976504Z 
2025-05-07T20:33:01.1976702Z self = <triton.compiler.compiler.ASTSource object at 0x7f3789032250>
2025-05-07T20:33:01.1977518Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1978015Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f35d7505a80>}
2025-05-07T20:33:01.1978753Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1978938Z context = <triton._C.libtriton.ir.context object at 0x7f35d75136b0>
2025-05-07T20:33:01.1978943Z 
2025-05-07T20:33:01.1979102Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1979367Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1979473Z                            module_map=module_map)
2025-05-07T20:33:01.1979640Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1979737Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1979816Z E       ^
2025-05-07T20:33:01.1980170Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1980175Z 
2025-05-07T20:33:01.1980585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1980590Z 
2025-05-07T20:33:01.1980694Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1980911Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1980988Z     T=128,
2025-05-07T20:33:01.1981066Z     D=7168,
2025-05-07T20:33:01.1981144Z     scale_ub=None,
2025-05-07T20:33:01.1981228Z     contiguous=True,
2025-05-07T20:33:01.1981318Z     compiled=False,
2025-05-07T20:33:01.1981386Z )
2025-05-07T20:33:01.1981643Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1981821Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:01.1981827Z 
2025-05-07T20:33:01.1981898Z     @given(
2025-05-07T20:33:01.1982025Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1982120Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1982234Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1982351Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1982461Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1982530Z     )
2025-05-07T20:33:01.1982771Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1982861Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1982934Z         self,
2025-05-07T20:33:01.1983014Z         T: int,
2025-05-07T20:33:01.1983088Z         D: int,
2025-05-07T20:33:01.1983180Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1983278Z         contiguous: bool,
2025-05-07T20:33:01.1983359Z         compiled: bool,
2025-05-07T20:33:01.1983522Z     ) -> None:
2025-05-07T20:33:01.1983619Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1983686Z     
2025-05-07T20:33:01.1983854Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1983929Z     
2025-05-07T20:33:01.1984016Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.1984142Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.1984227Z         x = x_sign * x_clamp
2025-05-07T20:33:01.1984304Z         x0 = x[:, :D]
2025-05-07T20:33:01.1984385Z         x1 = x[:, D:]
2025-05-07T20:33:01.1984454Z     
2025-05-07T20:33:01.1984533Z         if contiguous:
2025-05-07T20:33:01.1984628Z             x0 = x0.contiguous()
2025-05-07T20:33:01.1984713Z             x1 = x1.contiguous()
2025-05-07T20:33:01.1984831Z     
2025-05-07T20:33:01.1984919Z         if scale_ub is not None:
2025-05-07T20:33:01.1985022Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.1985166Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.1985245Z             )
2025-05-07T20:33:01.1985318Z         else:
2025-05-07T20:33:01.1985413Z             scale_ub_tensor = None
2025-05-07T20:33:01.1985483Z     
2025-05-07T20:33:01.1985610Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.1985700Z             op = silu_mul_quant
2025-05-07T20:33:01.1985780Z             if compiled:
2025-05-07T20:33:01.1985875Z                 op = torch.compile(op)
2025-05-07T20:33:01.1985982Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1986053Z     
2025-05-07T20:33:01.1986141Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.1986151Z 
2025-05-07T20:33:01.1986244Z moe/activation_test.py:117: 
2025-05-07T20:33:01.1986368Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1986474Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.1986571Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.1987060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.1987162Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.1987641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.1987863Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.1988200Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.1988293Z     kernel = self.compile(
2025-05-07T20:33:01.1988680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.1988855Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.1989034Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.1989039Z 
2025-05-07T20:33:01.1989247Z self = <triton.compiler.compiler.ASTSource object at 0x7f37890cf7d0>
2025-05-07T20:33:01.1990008Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.1990506Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f35d7506980>}
2025-05-07T20:33:01.1991237Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.1991434Z context = <triton._C.libtriton.ir.context object at 0x7f35d720f670>
2025-05-07T20:33:01.1991439Z 
2025-05-07T20:33:01.1991626Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.1991992Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.1992105Z                            module_map=module_map)
2025-05-07T20:33:01.1992263Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.1992356Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.1992439Z E       ^
2025-05-07T20:33:01.1992787Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.1992791Z 
2025-05-07T20:33:01.1993208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.1993253Z 
2025-05-07T20:33:01.1993354Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1993572Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1993659Z     T=2048,
2025-05-07T20:33:01.1993738Z     D=7168,
2025-05-07T20:33:01.1993821Z     scale_ub=1200.0,
2025-05-07T20:33:01.1993911Z     contiguous=True,
2025-05-07T20:33:01.1993993Z     compiled=False,
2025-05-07T20:33:01.1994070Z )
2025-05-07T20:33:01.1994284Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1994455Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:01.1994460Z 
2025-05-07T20:33:01.1994543Z     @given(
2025-05-07T20:33:01.1994657Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1994756Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.1994873Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.1994987Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.1995108Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.1995180Z     )
2025-05-07T20:33:01.1995421Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.1995523Z     def test_silu_mul_quant(
2025-05-07T20:33:01.1995596Z         self,
2025-05-07T20:33:01.1995671Z         T: int,
2025-05-07T20:33:01.1995752Z         D: int,
2025-05-07T20:33:01.1995846Z         scale_ub: Optional[float],
2025-05-07T20:33:01.1995936Z         contiguous: bool,
2025-05-07T20:33:01.1996023Z         compiled: bool,
2025-05-07T20:33:01.1996099Z     ) -> None:
2025-05-07T20:33:01.1996192Z         torch.manual_seed(2025)
2025-05-07T20:33:01.1996270Z     
2025-05-07T20:33:01.1996432Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.1998246Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:01.1998258Z 
2025-05-07T20:33:01.1998373Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:01.1998378Z 
2025-05-07T20:33:01.1998482Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.1998701Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.1998774Z     T=1,
2025-05-07T20:33:01.1998848Z     D=5120,
2025-05-07T20:33:01.1998925Z     scale_ub=1200.0,
2025-05-07T20:33:01.1999004Z     contiguous=True,
2025-05-07T20:33:01.1999088Z     compiled=False,
2025-05-07T20:33:01.1999158Z )
2025-05-07T20:33:01.1999370Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.1999539Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:01.1999546Z 
2025-05-07T20:33:01.1999618Z     @given(
2025-05-07T20:33:01.1999825Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.1999920Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.2000030Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.2000148Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.2000258Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.2000333Z     )
2025-05-07T20:33:01.2000574Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.2000664Z     def test_silu_mul_quant(
2025-05-07T20:33:01.2000736Z         self,
2025-05-07T20:33:01.2000814Z         T: int,
2025-05-07T20:33:01.2000889Z         D: int,
2025-05-07T20:33:01.2000991Z         scale_ub: Optional[float],
2025-05-07T20:33:01.2001142Z         contiguous: bool,
2025-05-07T20:33:01.2001224Z         compiled: bool,
2025-05-07T20:33:01.2001307Z     ) -> None:
2025-05-07T20:33:01.2001403Z         torch.manual_seed(2025)
2025-05-07T20:33:01.2001484Z     
2025-05-07T20:33:01.2001682Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.2001768Z     
2025-05-07T20:33:01.2001857Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.2001985Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.2002071Z         x = x_sign * x_clamp
2025-05-07T20:33:01.2002148Z         x0 = x[:, :D]
2025-05-07T20:33:01.2002230Z         x1 = x[:, D:]
2025-05-07T20:33:01.2002304Z     
2025-05-07T20:33:01.2002384Z         if contiguous:
2025-05-07T20:33:01.2002475Z             x0 = x0.contiguous()
2025-05-07T20:33:01.2002559Z             x1 = x1.contiguous()
2025-05-07T20:33:01.2002638Z     
2025-05-07T20:33:01.2002726Z         if scale_ub is not None:
2025-05-07T20:33:01.2002830Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.2002968Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.2003044Z             )
2025-05-07T20:33:01.2003122Z         else:
2025-05-07T20:33:01.2003226Z             scale_ub_tensor = None
2025-05-07T20:33:01.2003296Z     
2025-05-07T20:33:01.2003422Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.2003519Z             op = silu_mul_quant
2025-05-07T20:33:01.2003601Z             if compiled:
2025-05-07T20:33:01.2003698Z                 op = torch.compile(op)
2025-05-07T20:33:01.2003807Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.2003878Z     
2025-05-07T20:33:01.2003971Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.2003976Z 
2025-05-07T20:33:01.2004071Z moe/activation_test.py:117: 
2025-05-07T20:33:01.2004197Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.2004299Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.2004398Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.2004968Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.2005073Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.2005429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.2005656Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.2005989Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.2006082Z     kernel = self.compile(
2025-05-07T20:33:01.2006485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.2006654Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.2006782Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.2006792Z 
2025-05-07T20:33:01.2006994Z self = <triton.compiler.compiler.ASTSource object at 0x7f37ae8df8d0>
2025-05-07T20:33:01.2007801Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.2008337Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f35d7507e20>}
2025-05-07T20:33:01.2009069Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.2009263Z context = <triton._C.libtriton.ir.context object at 0x7f35d73b7870>
2025-05-07T20:33:01.2009305Z 
2025-05-07T20:33:01.2009465Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.2009725Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.2009842Z                            module_map=module_map)
2025-05-07T20:33:01.2010001Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.2010102Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.2010179Z E       ^
2025-05-07T20:33:01.2010527Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.2010532Z 
2025-05-07T20:33:01.2010946Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.2010950Z 
2025-05-07T20:33:01.2011049Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.2011265Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.2011352Z     T=2048,
2025-05-07T20:33:01.2011427Z     D=5120,
2025-05-07T20:33:01.2011510Z     scale_ub=None,
2025-05-07T20:33:01.2011594Z     contiguous=True,
2025-05-07T20:33:01.2011676Z     compiled=False,
2025-05-07T20:33:01.2011754Z )
2025-05-07T20:33:01.2011967Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.2012137Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:01.2012142Z 
2025-05-07T20:33:01.2012223Z     @given(
2025-05-07T20:33:01.2012337Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.2012436Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.2012557Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.2012670Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.2012782Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.2012854Z     )
2025-05-07T20:33:01.2013095Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.2013192Z     def test_silu_mul_quant(
2025-05-07T20:33:01.2013314Z         self,
2025-05-07T20:33:01.2013390Z         T: int,
2025-05-07T20:33:01.2013479Z         D: int,
2025-05-07T20:33:01.2013575Z         scale_ub: Optional[float],
2025-05-07T20:33:01.2013665Z         contiguous: bool,
2025-05-07T20:33:01.2013753Z         compiled: bool,
2025-05-07T20:33:01.2013829Z     ) -> None:
2025-05-07T20:33:01.2013922Z         torch.manual_seed(2025)
2025-05-07T20:33:01.2014001Z     
2025-05-07T20:33:01.2014165Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.2014243Z     
2025-05-07T20:33:01.2014331Z >       x_sign = torch.sign(x)
2025-05-07T20:33:01.2016153Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:01.2016206Z 
2025-05-07T20:33:01.2016322Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:01.2016327Z 
2025-05-07T20:33:01.2016425Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.2016649Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.2016724Z     T=16384,
2025-05-07T20:33:01.2016797Z     D=5120,
2025-05-07T20:33:01.2016882Z     scale_ub=None,
2025-05-07T20:33:01.2016964Z     contiguous=True,
2025-05-07T20:33:01.2017044Z     compiled=False,
2025-05-07T20:33:01.2017122Z )
2025-05-07T20:33:01.2017334Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.2017553Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:01.2017557Z 
2025-05-07T20:33:01.2017633Z     @given(
2025-05-07T20:33:01.2017749Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.2017852Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.2017963Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.2018077Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.2018191Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.2018264Z     )
2025-05-07T20:33:01.2018503Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.2018599Z     def test_silu_mul_quant(
2025-05-07T20:33:01.2018671Z         self,
2025-05-07T20:33:01.2018750Z         T: int,
2025-05-07T20:33:01.2018824Z         D: int,
2025-05-07T20:33:01.2018917Z         scale_ub: Optional[float],
2025-05-07T20:33:01.2019012Z         contiguous: bool,
2025-05-07T20:33:01.2019093Z         compiled: bool,
2025-05-07T20:33:01.2019167Z     ) -> None:
2025-05-07T20:33:01.2019263Z         torch.manual_seed(2025)
2025-05-07T20:33:01.2019335Z     
2025-05-07T20:33:01.2019502Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.2021265Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:01.2021272Z 
2025-05-07T20:33:01.2021386Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:01.2021394Z 
2025-05-07T20:33:01.2021498Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.2021758Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.2021839Z     T=4096,
2025-05-07T20:33:01.2026547Z     D=5120,
2025-05-07T20:33:01.2026641Z     scale_ub=None,
2025-05-07T20:33:01.2026734Z     contiguous=True,
2025-05-07T20:33:01.2026819Z     compiled=False,
2025-05-07T20:33:01.2026891Z )
2025-05-07T20:33:01.2027114Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.2027283Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:01.2027289Z 
2025-05-07T20:33:01.2027364Z     @given(
2025-05-07T20:33:01.2027632Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.2027733Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.2027844Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.2027964Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.2028078Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.2028158Z     )
2025-05-07T20:33:01.2028403Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.2028623Z     def test_silu_mul_quant(
2025-05-07T20:33:01.2028705Z         self,
2025-05-07T20:33:01.2028781Z         T: int,
2025-05-07T20:33:01.2028854Z         D: int,
2025-05-07T20:33:01.2028956Z         scale_ub: Optional[float],
2025-05-07T20:33:01.2029041Z         contiguous: bool,
2025-05-07T20:33:01.2029126Z         compiled: bool,
2025-05-07T20:33:01.2029210Z     ) -> None:
2025-05-07T20:33:01.2029300Z         torch.manual_seed(2025)
2025-05-07T20:33:01.2029372Z     
2025-05-07T20:33:01.2029544Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.2031307Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:01.2031367Z 
2025-05-07T20:33:01.2031486Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:01.2031491Z 
2025-05-07T20:33:01.2031613Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.2031868Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.2031939Z     T=2048,
2025-05-07T20:33:01.2032013Z     D=5120,
2025-05-07T20:33:01.2032108Z     scale_ub=None,
2025-05-07T20:33:01.2032192Z     contiguous=False,
2025-05-07T20:33:01.2032270Z     compiled=False,
2025-05-07T20:33:01.2032351Z )
2025-05-07T20:33:01.2032565Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.2032741Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:01.2032746Z 
2025-05-07T20:33:01.2032824Z     @given(
2025-05-07T20:33:01.2032942Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.2033044Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.2033155Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.2033267Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.2033379Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.2033451Z     )
2025-05-07T20:33:01.2033692Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.2033790Z     def test_silu_mul_quant(
2025-05-07T20:33:01.2033864Z         self,
2025-05-07T20:33:01.2033944Z         T: int,
2025-05-07T20:33:01.2034020Z         D: int,
2025-05-07T20:33:01.2034119Z         scale_ub: Optional[float],
2025-05-07T20:33:01.2034210Z         contiguous: bool,
2025-05-07T20:33:01.2034292Z         compiled: bool,
2025-05-07T20:33:01.2034414Z     ) -> None:
2025-05-07T20:33:01.2034513Z         torch.manual_seed(2025)
2025-05-07T20:33:01.2034590Z     
2025-05-07T20:33:01.2034753Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.2036507Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:01.2036515Z 
2025-05-07T20:33:01.2036629Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:01.2036633Z 
2025-05-07T20:33:01.2036738Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.2037007Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.2037126Z     T=4096,
2025-05-07T20:33:01.2037200Z     D=7168,
2025-05-07T20:33:01.2037278Z     scale_ub=None,
2025-05-07T20:33:01.2037369Z     contiguous=True,
2025-05-07T20:33:01.2037450Z     compiled=True,
2025-05-07T20:33:01.2037520Z )
2025-05-07T20:33:01.2037741Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.2037905Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:01.2037910Z 
2025-05-07T20:33:01.2037985Z     @given(
2025-05-07T20:33:01.2038105Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.2038200Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.2038309Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.2038508Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.2038623Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.2038706Z     )
2025-05-07T20:33:01.2038952Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.2039043Z     def test_silu_mul_quant(
2025-05-07T20:33:01.2039124Z         self,
2025-05-07T20:33:01.2039198Z         T: int,
2025-05-07T20:33:01.2039273Z         D: int,
2025-05-07T20:33:01.2039374Z         scale_ub: Optional[float],
2025-05-07T20:33:01.2039459Z         contiguous: bool,
2025-05-07T20:33:01.2039542Z         compiled: bool,
2025-05-07T20:33:01.2039623Z     ) -> None:
2025-05-07T20:33:01.2039713Z         torch.manual_seed(2025)
2025-05-07T20:33:01.2039783Z     
2025-05-07T20:33:01.2039953Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.2042057Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:01.2042076Z 
2025-05-07T20:33:01.2042193Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:01.2042198Z 
2025-05-07T20:33:01.2042297Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.2042523Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.2042600Z     T=2048,
2025-05-07T20:33:01.2042673Z     D=5120,
2025-05-07T20:33:01.2042758Z     scale_ub=1200.0,
2025-05-07T20:33:01.2042839Z     contiguous=False,
2025-05-07T20:33:01.2042924Z     compiled=False,
2025-05-07T20:33:01.2043002Z )
2025-05-07T20:33:01.2043220Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.2043550Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:01.2043557Z 
2025-05-07T20:33:01.2043634Z     @given(
2025-05-07T20:33:01.2043751Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.2043855Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.2043968Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.2044085Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.2044201Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.2044274Z     )
2025-05-07T20:33:01.2044512Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.2044610Z     def test_silu_mul_quant(
2025-05-07T20:33:01.2044685Z         self,
2025-05-07T20:33:01.2044768Z         T: int,
2025-05-07T20:33:01.2044844Z         D: int,
2025-05-07T20:33:01.2044941Z         scale_ub: Optional[float],
2025-05-07T20:33:01.2045039Z         contiguous: bool,
2025-05-07T20:33:01.2045127Z         compiled: bool,
2025-05-07T20:33:01.2045326Z     ) -> None:
2025-05-07T20:33:01.2045429Z         torch.manual_seed(2025)
2025-05-07T20:33:01.2045501Z     
2025-05-07T20:33:01.2045673Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.2047428Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:01.2047498Z 
2025-05-07T20:33:01.2047611Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:01.2047618Z 
2025-05-07T20:33:01.2047725Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.2047947Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.2048031Z     T=4096,
2025-05-07T20:33:01.2048104Z     D=7168,
2025-05-07T20:33:01.2048186Z     scale_ub=1200.0,
2025-05-07T20:33:01.2048272Z     contiguous=True,
2025-05-07T20:33:01.2048354Z     compiled=False,
2025-05-07T20:33:01.2048430Z )
2025-05-07T20:33:01.2048650Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.2048816Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:01.2048821Z 
2025-05-07T20:33:01.2048896Z     @given(
2025-05-07T20:33:01.2049014Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.2049111Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.2049220Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.2049340Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.2049449Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.2049536Z     )
2025-05-07T20:33:01.2049776Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.2049866Z     def test_silu_mul_quant(
2025-05-07T20:33:01.2049950Z         self,
2025-05-07T20:33:01.2050028Z         T: int,
2025-05-07T20:33:01.2050101Z         D: int,
2025-05-07T20:33:01.2050200Z         scale_ub: Optional[float],
2025-05-07T20:33:01.2050286Z         contiguous: bool,
2025-05-07T20:33:01.2050367Z         compiled: bool,
2025-05-07T20:33:01.2050449Z     ) -> None:
2025-05-07T20:33:01.2050541Z         torch.manual_seed(2025)
2025-05-07T20:33:01.2050612Z     
2025-05-07T20:33:01.2050779Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.2052585Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:01.2052599Z 
2025-05-07T20:33:01.2052712Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:01.2052716Z 
2025-05-07T20:33:01.2052821Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.2053047Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.2053126Z     T=16384,
2025-05-07T20:33:01.2053204Z     D=7168,
2025-05-07T20:33:01.2053293Z     scale_ub=None,
2025-05-07T20:33:01.2053383Z     contiguous=False,
2025-05-07T20:33:01.2053464Z     compiled=True,
2025-05-07T20:33:01.2053544Z )
2025-05-07T20:33:01.2053761Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.2054024Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:01.2054029Z 
2025-05-07T20:33:01.2054108Z     @given(
2025-05-07T20:33:01.2054224Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.2054326Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.2054440Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.2054554Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.2054671Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.2054748Z     )
2025-05-07T20:33:01.2054988Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.2055085Z     def test_silu_mul_quant(
2025-05-07T20:33:01.2055205Z         self,
2025-05-07T20:33:01.2055285Z         T: int,
2025-05-07T20:33:01.2055361Z         D: int,
2025-05-07T20:33:01.2055464Z         scale_ub: Optional[float],
2025-05-07T20:33:01.2055562Z         contiguous: bool,
2025-05-07T20:33:01.2055650Z         compiled: bool,
2025-05-07T20:33:01.2055725Z     ) -> None:
2025-05-07T20:33:01.2055823Z         torch.manual_seed(2025)
2025-05-07T20:33:01.2055897Z     
2025-05-07T20:33:01.2056063Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.2057828Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:01.2057836Z 
2025-05-07T20:33:01.2057954Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:01.2057961Z 
2025-05-07T20:33:01.2058069Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.2058288Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.2058374Z     T=4096,
2025-05-07T20:33:01.2058454Z     D=7168,
2025-05-07T20:33:01.2058535Z     scale_ub=None,
2025-05-07T20:33:01.2058625Z     contiguous=True,
2025-05-07T20:33:01.2058710Z     compiled=False,
2025-05-07T20:33:01.2058785Z )
2025-05-07T20:33:01.2059006Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.2059174Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:01.2059178Z 
2025-05-07T20:33:01.2059258Z     @given(
2025-05-07T20:33:01.2059386Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.2059485Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.2059643Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.2059768Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.2059886Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.2059967Z     )
2025-05-07T20:33:01.2060211Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.2060304Z     def test_silu_mul_quant(
2025-05-07T20:33:01.2060389Z         self,
2025-05-07T20:33:01.2060467Z         T: int,
2025-05-07T20:33:01.2060546Z         D: int,
2025-05-07T20:33:01.2060652Z         scale_ub: Optional[float],
2025-05-07T20:33:01.2060744Z         contiguous: bool,
2025-05-07T20:33:01.2060831Z         compiled: bool,
2025-05-07T20:33:01.2060922Z     ) -> None:
2025-05-07T20:33:01.2061018Z         torch.manual_seed(2025)
2025-05-07T20:33:01.2061092Z     
2025-05-07T20:33:01.2061267Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.2063214Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:01.2063295Z 
2025-05-07T20:33:01.2063416Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:01.2063420Z 
2025-05-07T20:33:01.2063521Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.2063748Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.2063870Z     T=16384,
2025-05-07T20:33:01.2063948Z     D=7168,
2025-05-07T20:33:01.2064039Z     scale_ub=None,
2025-05-07T20:33:01.2064123Z     contiguous=True,
2025-05-07T20:33:01.2064212Z     compiled=False,
2025-05-07T20:33:01.2064295Z )
2025-05-07T20:33:01.2064515Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.2064696Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:01.2064701Z 
2025-05-07T20:33:01.2064778Z     @given(
2025-05-07T20:33:01.2064896Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.2064998Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.2065108Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.2065226Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.2065351Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.2065429Z     )
2025-05-07T20:33:01.2065670Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.2065771Z     def test_silu_mul_quant(
2025-05-07T20:33:01.2065847Z         self,
2025-05-07T20:33:01.2065927Z         T: int,
2025-05-07T20:33:01.2066004Z         D: int,
2025-05-07T20:33:01.2066110Z         scale_ub: Optional[float],
2025-05-07T20:33:01.2066204Z         contiguous: bool,
2025-05-07T20:33:01.2066289Z         compiled: bool,
2025-05-07T20:33:01.2066367Z     ) -> None:
2025-05-07T20:33:01.2066467Z         torch.manual_seed(2025)
2025-05-07T20:33:01.2066542Z     
2025-05-07T20:33:01.2066714Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.2068618Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:01.2068630Z 
2025-05-07T20:33:01.2068750Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:01.2068755Z 
2025-05-07T20:33:01.2068861Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.2069081Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.2069167Z     T=16384,
2025-05-07T20:33:01.2069247Z     D=7168,
2025-05-07T20:33:01.2069329Z     scale_ub=1200.0,
2025-05-07T20:33:01.2069423Z     contiguous=True,
2025-05-07T20:33:01.2069507Z     compiled=False,
2025-05-07T20:33:01.2069581Z )
2025-05-07T20:33:01.2069802Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.2069976Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:01.2069983Z 
2025-05-07T20:33:01.2070068Z     @given(
2025-05-07T20:33:01.2070190Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.2070290Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.2070453Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.2070650Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.2070760Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.2070839Z     )
2025-05-07T20:33:01.2071089Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.2071180Z     def test_silu_mul_quant(
2025-05-07T20:33:01.2071260Z         self,
2025-05-07T20:33:01.2071338Z         T: int,
2025-05-07T20:33:01.2071412Z         D: int,
2025-05-07T20:33:01.2071514Z         scale_ub: Optional[float],
2025-05-07T20:33:01.2071603Z         contiguous: bool,
2025-05-07T20:33:01.2071687Z         compiled: bool,
2025-05-07T20:33:01.2071772Z     ) -> None:
2025-05-07T20:33:01.2071929Z         torch.manual_seed(2025)
2025-05-07T20:33:01.2072008Z     
2025-05-07T20:33:01.2072203Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.2074115Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:01.2074135Z 
2025-05-07T20:33:01.2074293Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:01.2074300Z 
2025-05-07T20:33:01.2074434Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.2074731Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.2074838Z     T=128,
2025-05-07T20:33:01.2074943Z     D=5120,
2025-05-07T20:33:01.2075073Z     scale_ub=1200.0,
2025-05-07T20:33:01.2075172Z     contiguous=False,
2025-05-07T20:33:01.2075261Z     compiled=False,
2025-05-07T20:33:01.2075344Z )
2025-05-07T20:33:01.2075558Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.2075736Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:01.2075740Z 
2025-05-07T20:33:01.2075817Z     @given(
2025-05-07T20:33:01.2075938Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.2076040Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.2076152Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.2076267Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.2076386Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.2076463Z     )
2025-05-07T20:33:01.2076704Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.2076806Z     def test_silu_mul_quant(
2025-05-07T20:33:01.2076955Z         self,
2025-05-07T20:33:01.2077047Z         T: int,
2025-05-07T20:33:01.2077128Z         D: int,
2025-05-07T20:33:01.2077224Z         scale_ub: Optional[float],
2025-05-07T20:33:01.2077320Z         contiguous: bool,
2025-05-07T20:33:01.2077405Z         compiled: bool,
2025-05-07T20:33:01.2077480Z     ) -> None:
2025-05-07T20:33:01.2077583Z         torch.manual_seed(2025)
2025-05-07T20:33:01.2077655Z     
2025-05-07T20:33:01.2077819Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.2077900Z     
2025-05-07T20:33:01.2077995Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.2078117Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.2078211Z         x = x_sign * x_clamp
2025-05-07T20:33:01.2078291Z         x0 = x[:, :D]
2025-05-07T20:33:01.2078374Z         x1 = x[:, D:]
2025-05-07T20:33:01.2078454Z     
2025-05-07T20:33:01.2078533Z         if contiguous:
2025-05-07T20:33:01.2078633Z             x0 = x0.contiguous()
2025-05-07T20:33:01.2078727Z             x1 = x1.contiguous()
2025-05-07T20:33:01.2078890Z     
2025-05-07T20:33:01.2078985Z         if scale_ub is not None:
2025-05-07T20:33:01.2079090Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.2079224Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.2079304Z             )
2025-05-07T20:33:01.2079381Z         else:
2025-05-07T20:33:01.2079473Z             scale_ub_tensor = None
2025-05-07T20:33:01.2079550Z     
2025-05-07T20:33:01.2079675Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.2079763Z             op = silu_mul_quant
2025-05-07T20:33:01.2079853Z             if compiled:
2025-05-07T20:33:01.2079950Z                 op = torch.compile(op)
2025-05-07T20:33:01.2080059Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.2080174Z     
2025-05-07T20:33:01.2080263Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.2080267Z 
2025-05-07T20:33:01.2080370Z moe/activation_test.py:117: 
2025-05-07T20:33:01.2080499Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.2080601Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.2080702Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.2081203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.2081301Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.2081663Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.2081883Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.2082229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.2082322Z     kernel = self.compile(
2025-05-07T20:33:01.2082724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.2082909Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.2083033Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.2083038Z 
2025-05-07T20:33:01.2083243Z self = <triton.compiler.compiler.ASTSource object at 0x7f35d7589ad0>
2025-05-07T20:33:01.2084012Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.2084509Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f35d72fcae0>}
2025-05-07T20:33:01.2085309Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.2085503Z context = <triton._C.libtriton.ir.context object at 0x7f35d71240b0>
2025-05-07T20:33:01.2085508Z 
2025-05-07T20:33:01.2085673Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.2085935Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.2086040Z                            module_map=module_map)
2025-05-07T20:33:01.2086205Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.2086302Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.2086383Z E       ^
2025-05-07T20:33:01.2086735Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.2086743Z 
2025-05-07T20:33:01.2087159Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.2087164Z 
2025-05-07T20:33:01.2087348Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.2087567Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.2087647Z     T=2048,
2025-05-07T20:33:01.2087721Z     D=7168,
2025-05-07T20:33:01.2087802Z     scale_ub=None,
2025-05-07T20:33:01.2087897Z     contiguous=False,
2025-05-07T20:33:01.2087981Z     compiled=False,
2025-05-07T20:33:01.2088053Z )
2025-05-07T20:33:01.2088273Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.2088446Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:01.2088451Z 
2025-05-07T20:33:01.2088527Z     @given(
2025-05-07T20:33:01.2088650Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.2088885Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.2088999Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.2089118Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.2089235Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.2089328Z     )
2025-05-07T20:33:01.2089568Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.2089657Z     def test_silu_mul_quant(
2025-05-07T20:33:01.2089747Z         self,
2025-05-07T20:33:01.2089823Z         T: int,
2025-05-07T20:33:01.2089900Z         D: int,
2025-05-07T20:33:01.2090004Z         scale_ub: Optional[float],
2025-05-07T20:33:01.2090092Z         contiguous: bool,
2025-05-07T20:33:01.2090175Z         compiled: bool,
2025-05-07T20:33:01.2090258Z     ) -> None:
2025-05-07T20:33:01.2090349Z         torch.manual_seed(2025)
2025-05-07T20:33:01.2090426Z     
2025-05-07T20:33:01.2090590Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.2092366Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:01.2092380Z 
2025-05-07T20:33:01.2092494Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:01.2092499Z 
2025-05-07T20:33:01.2092598Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.2092825Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.2092901Z     T=128,
2025-05-07T20:33:01.2092982Z     D=7168,
2025-05-07T20:33:01.2093069Z     scale_ub=1200.0,
2025-05-07T20:33:01.2093152Z     contiguous=True,
2025-05-07T20:33:01.2093234Z     compiled=True,
2025-05-07T20:33:01.2093358Z )
2025-05-07T20:33:01.2093576Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.2093749Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:01.2093753Z 
2025-05-07T20:33:01.2093831Z     @given(
2025-05-07T20:33:01.2093947Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.2094052Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.2094164Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.2094281Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.2094398Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.2094472Z     )
2025-05-07T20:33:01.2094713Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.2094814Z     def test_silu_mul_quant(
2025-05-07T20:33:01.2094891Z         self,
2025-05-07T20:33:01.2094976Z         T: int,
2025-05-07T20:33:01.2095052Z         D: int,
2025-05-07T20:33:01.2095149Z         scale_ub: Optional[float],
2025-05-07T20:33:01.2095326Z         contiguous: bool,
2025-05-07T20:33:01.2095412Z         compiled: bool,
2025-05-07T20:33:01.2095495Z     ) -> None:
2025-05-07T20:33:01.2095593Z         torch.manual_seed(2025)
2025-05-07T20:33:01.2095665Z     
2025-05-07T20:33:01.2095828Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.2095907Z     
2025-05-07T20:33:01.2095996Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.2096119Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.2096210Z         x = x_sign * x_clamp
2025-05-07T20:33:01.2096290Z         x0 = x[:, :D]
2025-05-07T20:33:01.2096374Z         x1 = x[:, D:]
2025-05-07T20:33:01.2096445Z     
2025-05-07T20:33:01.2096525Z         if contiguous:
2025-05-07T20:33:01.2096661Z             x0 = x0.contiguous()
2025-05-07T20:33:01.2096747Z             x1 = x1.contiguous()
2025-05-07T20:33:01.2096818Z     
2025-05-07T20:33:01.2096914Z         if scale_ub is not None:
2025-05-07T20:33:01.2097022Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:01.2097162Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:01.2097245Z             )
2025-05-07T20:33:01.2097321Z         else:
2025-05-07T20:33:01.2097412Z             scale_ub_tensor = None
2025-05-07T20:33:01.2097490Z     
2025-05-07T20:33:01.2097619Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:01.2097712Z             op = silu_mul_quant
2025-05-07T20:33:01.2097793Z             if compiled:
2025-05-07T20:33:01.2097890Z                 op = torch.compile(op)
2025-05-07T20:33:01.2097999Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.2098071Z     
2025-05-07T20:33:01.2098163Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:01.2098171Z 
2025-05-07T20:33:01.2098272Z moe/activation_test.py:117: 
2025-05-07T20:33:01.2098400Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.2098501Z moe/activation_test.py:115: in fn
2025-05-07T20:33:01.2098611Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:01.2098975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:01.2099072Z     return fn(*args, **kwargs)
2025-05-07T20:33:01.2099562Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:01.2099657Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:01.2100017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:01.2100237Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:01.2100579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:01.2100676Z     kernel = self.compile(
2025-05-07T20:33:01.2101129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:01.2101314Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:01.2101439Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:01.2101443Z 
2025-05-07T20:33:01.2101643Z self = <triton.compiler.compiler.ASTSource object at 0x7f35d777f650>
2025-05-07T20:33:01.2102417Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:01.2102912Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f37f6b7bce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f35d7180040>}
2025-05-07T20:33:01.2103860Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:01.2104163Z context = <triton._C.libtriton.ir.context object at 0x7f35d710bf70>
2025-05-07T20:33:01.2104171Z 
2025-05-07T20:33:01.2104359Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:01.2104628Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:01.2104734Z                            module_map=module_map)
2025-05-07T20:33:01.2104898Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:01.2104995Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:01.2105071Z E       ^
2025-05-07T20:33:01.2105431Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:01.2105484Z 
2025-05-07T20:33:01.2105905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:01.2105914Z 
2025-05-07T20:33:01.2106022Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.2106240Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.2106317Z     T=128,
2025-05-07T20:33:01.2106401Z     D=7168,
2025-05-07T20:33:01.2106481Z     scale_ub=1200.0,
2025-05-07T20:33:01.2106566Z     contiguous=True,
2025-05-07T20:33:01.2106653Z     compiled=False,
2025-05-07T20:33:01.2106726Z )
2025-05-07T20:33:01.2106942Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.2107117Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:01.2107122Z 
2025-05-07T20:33:01.2107199Z     @given(
2025-05-07T20:33:01.2107325Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.2107534Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.2107660Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.2107788Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.2107900Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.2107970Z     )
2025-05-07T20:33:01.2108220Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.2108309Z     def test_silu_mul_quant(
2025-05-07T20:33:01.2108382Z         self,
2025-05-07T20:33:01.2108466Z         T: int,
2025-05-07T20:33:01.2108543Z         D: int,
2025-05-07T20:33:01.2108646Z         scale_ub: Optional[float],
2025-05-07T20:33:01.2108733Z         contiguous: bool,
2025-05-07T20:33:01.2108817Z         compiled: bool,
2025-05-07T20:33:01.2108901Z     ) -> None:
2025-05-07T20:33:01.2108993Z         torch.manual_seed(2025)
2025-05-07T20:33:01.2109066Z     
2025-05-07T20:33:01.2109238Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.2109309Z     
2025-05-07T20:33:01.2109458Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.2109594Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.2111354Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:01.2111360Z 
2025-05-07T20:33:01.2111483Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:01.2111490Z 
2025-05-07T20:33:01.2111590Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.2111816Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.2111895Z     T=128,
2025-05-07T20:33:01.2112015Z     D=5120,
2025-05-07T20:33:01.2112137Z     scale_ub=1200.0,
2025-05-07T20:33:01.2112223Z     contiguous=True,
2025-05-07T20:33:01.2112302Z     compiled=True,
2025-05-07T20:33:01.2112378Z )
2025-05-07T20:33:01.2112592Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.2112755Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:01.2112759Z 
2025-05-07T20:33:01.2112839Z     @given(
2025-05-07T20:33:01.2112953Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.2113053Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.2113164Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.2113277Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.2113437Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.2113511Z     )
2025-05-07T20:33:01.2113754Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.2113854Z     def test_silu_mul_quant(
2025-05-07T20:33:01.2113929Z         self,
2025-05-07T20:33:01.2114003Z         T: int,
2025-05-07T20:33:01.2114084Z         D: int,
2025-05-07T20:33:01.2114180Z         scale_ub: Optional[float],
2025-05-07T20:33:01.2114267Z         contiguous: bool,
2025-05-07T20:33:01.2114356Z         compiled: bool,
2025-05-07T20:33:01.2114431Z     ) -> None:
2025-05-07T20:33:01.2114527Z         torch.manual_seed(2025)
2025-05-07T20:33:01.2114596Z     
2025-05-07T20:33:01.2114758Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.2114845Z     
2025-05-07T20:33:01.2114934Z         x_sign = torch.sign(x)
2025-05-07T20:33:01.2115057Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:01.2116817Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:01.2116825Z 
2025-05-07T20:33:01.2116940Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:01.2116944Z 
2025-05-07T20:33:01.2117046Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:01.2117266Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:01.2117342Z     T=128,
2025-05-07T20:33:01.2117421Z     D=7168,
2025-05-07T20:33:01.2117504Z     scale_ub=None,
2025-05-07T20:33:01.2117594Z     contiguous=True,
2025-05-07T20:33:01.2117676Z     compiled=True,
2025-05-07T20:33:01.2117745Z )
2025-05-07T20:33:01.2118008Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:01.2118174Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:01.2118179Z 
2025-05-07T20:33:01.2118250Z     @given(
2025-05-07T20:33:01.2118369Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:01.2118468Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:01.2118584Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:01.2118704Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:01.2118816Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:01.2118890Z     )
2025-05-07T20:33:01.2119131Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:01.2119219Z     def test_silu_mul_quant(
2025-05-07T20:33:01.2119302Z         self,
2025-05-07T20:33:01.2119374Z         T: int,
2025-05-07T20:33:01.2119451Z         D: int,
2025-05-07T20:33:01.2119561Z         scale_ub: Optional[float],
2025-05-07T20:33:01.2119646Z         contiguous: bool,
2025-05-07T20:33:01.2119810Z         compiled: bool,
2025-05-07T20:33:01.2119896Z     ) -> None:
2025-05-07T20:33:01.2119992Z         torch.manual_seed(2025)
2025-05-07T20:33:01.2120065Z     
2025-05-07T20:33:01.2120234Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:01.2122047Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:01.2122091Z 
2025-05-07T20:33:01.2122217Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:01.2122353Z =============================== warnings summary ===============================
2025-05-07T20:33:01.2122665Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:01.2122963Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:01.2123256Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:01.2124133Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:33:01.2124366Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:33:01.2124371Z 
2025-05-07T20:33:01.2124586Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:33:01.2124759Z ================= 1 failed, 1 deselected, 3 warnings in 13.91s =================
2025-05-07T20:33:02.9396373Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:33:03.0018332Z [EXEC] [ATTEMPT 0/2] Command attempt failed.
2025-05-07T20:33:03.0018737Z 
2025-05-07T20:33:05.0034444Z [EXEC] [ATTEMPT 1/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:33:07.1654302Z ============================= test session starts ==============================
2025-05-07T20:33:07.1654975Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:33:07.1655758Z cachedir: .pytest_cache
2025-05-07T20:33:07.1656344Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:33:07.1657072Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:33:07.1657475Z plugins: hypothesis-6.131.14
2025-05-07T20:33:08.7299880Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:33:08.8269444Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:33:08.8269851Z run-last-failure: rerun previous 1 failure
2025-05-07T20:33:08.8270065Z 
2025-05-07T20:33:10.9537239Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.9537951Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.9538396Z     T=1,
2025-05-07T20:33:10.9538618Z     D=5120,
2025-05-07T20:33:10.9538900Z     scale_ub=None,
2025-05-07T20:33:10.9539188Z     contiguous=True,
2025-05-07T20:33:10.9539885Z     compiled=True,
2025-05-07T20:33:10.9540351Z )
2025-05-07T20:33:10.9540691Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.9541183Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:10.9541447Z 
2025-05-07T20:33:10.9541523Z     @given(
2025-05-07T20:33:10.9541754Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.9542062Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.9542354Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.9542682Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.9543002Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.9543396Z     )
2025-05-07T20:33:10.9543740Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.9544223Z     def test_silu_mul_quant(
2025-05-07T20:33:10.9544485Z         self,
2025-05-07T20:33:10.9544668Z         T: int,
2025-05-07T20:33:10.9551750Z         D: int,
2025-05-07T20:33:10.9551982Z         scale_ub: Optional[float],
2025-05-07T20:33:10.9552248Z         contiguous: bool,
2025-05-07T20:33:10.9552493Z         compiled: bool,
2025-05-07T20:33:10.9552723Z     ) -> None:
2025-05-07T20:33:10.9552931Z         torch.manual_seed(2025)
2025-05-07T20:33:10.9553173Z     
2025-05-07T20:33:10.9553444Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.9553792Z     
2025-05-07T20:33:10.9553994Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.9554314Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.9554625Z         x = x_sign * x_clamp
2025-05-07T20:33:10.9554855Z         x0 = x[:, :D]
2025-05-07T20:33:10.9555070Z         x1 = x[:, D:]
2025-05-07T20:33:10.9555278Z     
2025-05-07T20:33:10.9555457Z         if contiguous:
2025-05-07T20:33:10.9555689Z             x0 = x0.contiguous()
2025-05-07T20:33:10.9555943Z             x1 = x1.contiguous()
2025-05-07T20:33:10.9556174Z     
2025-05-07T20:33:10.9556357Z         if scale_ub is not None:
2025-05-07T20:33:10.9556624Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.9556954Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.9557259Z             )
2025-05-07T20:33:10.9557449Z         else:
2025-05-07T20:33:10.9557650Z             scale_ub_tensor = None
2025-05-07T20:33:10.9557894Z     
2025-05-07T20:33:10.9558123Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.9558426Z             op = silu_mul_quant
2025-05-07T20:33:10.9558663Z             if compiled:
2025-05-07T20:33:10.9558903Z                 op = torch.compile(op)
2025-05-07T20:33:10.9559195Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.9559459Z     
2025-05-07T20:33:10.9559647Z         y_fp8, y_scale = fn()
2025-05-07T20:33:10.9559925Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:10.9560331Z     
2025-05-07T20:33:10.9560576Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.9560899Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:10.9561178Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:10.9561482Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:10.9561837Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:10.9562141Z     
2025-05-07T20:33:10.9562329Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:10.9562525Z 
2025-05-07T20:33:10.9562623Z moe/activation_test.py:126: 
2025-05-07T20:33:10.9562914Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.9563237Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:10.9563564Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:10.9564359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:10.9565254Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:10.9565798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.9566464Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.9567158Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:10.9567862Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:10.9568585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:10.9569212Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:10.9570785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:10.9571296Z     fn()
2025-05-07T20:33:10.9571812Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:10.9572404Z     self.fn.run(
2025-05-07T20:33:10.9572862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.9573382Z     kernel = self.compile(
2025-05-07T20:33:10.9573927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.9574598Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.9574982Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.9575212Z 
2025-05-07T20:33:10.9575418Z self = <triton.compiler.compiler.ASTSource object at 0x7f39067be270>
2025-05-07T20:33:10.9576487Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.9577924Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38ffeae700>}
2025-05-07T20:33:10.9579281Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.9580330Z context = <triton._C.libtriton.ir.context object at 0x7f3906e88cb0>
2025-05-07T20:33:10.9580609Z 
2025-05-07T20:33:10.9580771Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.9581286Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.9581800Z                            module_map=module_map)
2025-05-07T20:33:10.9582163Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.9582515Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:10.9582777Z E       ^
2025-05-07T20:33:10.9583228Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.9583691Z 
2025-05-07T20:33:10.9584115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:10.9584622Z 
2025-05-07T20:33:10.9584721Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:10.9585125Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:10.9585527Z     T=2048,
2025-05-07T20:33:10.9585710Z     D=5120,
2025-05-07T20:33:10.9585901Z     scale_ub=1200.0,
2025-05-07T20:33:10.9586119Z     contiguous=True,
2025-05-07T20:33:10.9586327Z     compiled=False,
2025-05-07T20:33:10.9586525Z )
2025-05-07T20:33:10.9586844Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:10.9587518Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:10.9587792Z 
2025-05-07T20:33:10.9587866Z     @given(
2025-05-07T20:33:10.9588088Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:10.9588386Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:10.9588687Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:10.9589006Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:10.9589318Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:10.9589595Z     )
2025-05-07T20:33:10.9589948Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:10.9590426Z     def test_silu_mul_quant(
2025-05-07T20:33:10.9590663Z         self,
2025-05-07T20:33:10.9590855Z         T: int,
2025-05-07T20:33:10.9591046Z         D: int,
2025-05-07T20:33:10.9591257Z         scale_ub: Optional[float],
2025-05-07T20:33:10.9591526Z         contiguous: bool,
2025-05-07T20:33:10.9591762Z         compiled: bool,
2025-05-07T20:33:10.9591972Z     ) -> None:
2025-05-07T20:33:10.9592180Z         torch.manual_seed(2025)
2025-05-07T20:33:10.9592421Z     
2025-05-07T20:33:10.9592682Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:10.9593015Z     
2025-05-07T20:33:10.9593192Z         x_sign = torch.sign(x)
2025-05-07T20:33:10.9593468Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:10.9593769Z         x = x_sign * x_clamp
2025-05-07T20:33:10.9594005Z         x0 = x[:, :D]
2025-05-07T20:33:10.9594210Z         x1 = x[:, D:]
2025-05-07T20:33:10.9594413Z     
2025-05-07T20:33:10.9594594Z         if contiguous:
2025-05-07T20:33:10.9594815Z             x0 = x0.contiguous()
2025-05-07T20:33:10.9595072Z             x1 = x1.contiguous()
2025-05-07T20:33:10.9595310Z     
2025-05-07T20:33:10.9595490Z         if scale_ub is not None:
2025-05-07T20:33:10.9595751Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:10.9596075Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:10.9596376Z             )
2025-05-07T20:33:10.9596552Z         else:
2025-05-07T20:33:10.9596748Z             scale_ub_tensor = None
2025-05-07T20:33:10.9596990Z     
2025-05-07T20:33:10.9597208Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:10.9597509Z             op = silu_mul_quant
2025-05-07T20:33:10.9597750Z             if compiled:
2025-05-07T20:33:10.9597985Z                 op = torch.compile(op)
2025-05-07T20:33:10.9598274Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.9598542Z     
2025-05-07T20:33:10.9598725Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:10.9598890Z 
2025-05-07T20:33:10.9598987Z moe/activation_test.py:117: 
2025-05-07T20:33:10.9599281Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.9599659Z moe/activation_test.py:115: in fn
2025-05-07T20:33:10.9599930Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:10.9600615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:10.9601293Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:10.9601836Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:10.9602503Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:10.9603155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:10.9603674Z     kernel = self.compile(
2025-05-07T20:33:10.9604219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:10.9604885Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:10.9605377Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:10.9605640Z 
2025-05-07T20:33:10.9605848Z self = <triton.compiler.compiler.ASTSource object at 0x7f38ffe3d090>
2025-05-07T20:33:10.9606904Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:10.9608251Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38ffd5e020>}
2025-05-07T20:33:10.9609568Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:10.9610668Z context = <triton._C.libtriton.ir.context object at 0x7f39043badf0>
2025-05-07T20:33:10.9610952Z 
2025-05-07T20:33:10.9611116Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:10.9611630Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:10.9612104Z                            module_map=module_map)
2025-05-07T20:33:10.9612462Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:10.9612801Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:10.9613052Z E       ^
2025-05-07T20:33:10.9613516Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:10.9613960Z 
2025-05-07T20:33:10.9614431Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.6219381Z 
2025-05-07T20:33:11.6220250Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.6221146Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.6221968Z     T=2048,
2025-05-07T20:33:11.6222326Z     D=5120,
2025-05-07T20:33:11.6222689Z     scale_ub=1200.0,
2025-05-07T20:33:11.6223107Z     contiguous=True,
2025-05-07T20:33:11.6223531Z     compiled=True,
2025-05-07T20:33:11.6223922Z )
2025-05-07T20:33:11.6224359Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.6224882Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:11.6225163Z 
2025-05-07T20:33:11.6225244Z     @given(
2025-05-07T20:33:11.6225467Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.6225769Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.6226074Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.6226404Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.6226719Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.6227287Z     )
2025-05-07T20:33:11.6227692Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.6228131Z     def test_silu_mul_quant(
2025-05-07T20:33:11.6228372Z         self,
2025-05-07T20:33:11.6228602Z         T: int,
2025-05-07T20:33:11.6228796Z         D: int,
2025-05-07T20:33:11.6229014Z         scale_ub: Optional[float],
2025-05-07T20:33:11.6229269Z         contiguous: bool,
2025-05-07T20:33:11.6229501Z         compiled: bool,
2025-05-07T20:33:11.6229738Z     ) -> None:
2025-05-07T20:33:11.6229945Z         torch.manual_seed(2025)
2025-05-07T20:33:11.6230175Z     
2025-05-07T20:33:11.6230443Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.6230786Z     
2025-05-07T20:33:11.6230969Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.6231261Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.6231572Z         x = x_sign * x_clamp
2025-05-07T20:33:11.6231797Z         x0 = x[:, :D]
2025-05-07T20:33:11.6232012Z         x1 = x[:, D:]
2025-05-07T20:33:11.6232302Z     
2025-05-07T20:33:11.6232554Z         if contiguous:
2025-05-07T20:33:11.6232778Z             x0 = x0.contiguous()
2025-05-07T20:33:11.6233033Z             x1 = x1.contiguous()
2025-05-07T20:33:11.6233269Z     
2025-05-07T20:33:11.6233452Z         if scale_ub is not None:
2025-05-07T20:33:11.6233722Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.6234050Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.6234358Z             )
2025-05-07T20:33:11.6234550Z         else:
2025-05-07T20:33:11.6234755Z             scale_ub_tensor = None
2025-05-07T20:33:11.6234992Z     
2025-05-07T20:33:11.6235216Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.6235517Z             op = silu_mul_quant
2025-05-07T20:33:11.6235844Z             if compiled:
2025-05-07T20:33:11.6236084Z                 op = torch.compile(op)
2025-05-07T20:33:11.6236378Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.6236640Z     
2025-05-07T20:33:11.6236828Z         y_fp8, y_scale = fn()
2025-05-07T20:33:11.6237102Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:11.6237383Z     
2025-05-07T20:33:11.6237606Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.6237930Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:11.6238216Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:11.6238518Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:11.6238867Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:11.6239168Z     
2025-05-07T20:33:11.6239355Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:11.6239552Z 
2025-05-07T20:33:11.6239649Z moe/activation_test.py:126: 
2025-05-07T20:33:11.6239943Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.6240534Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:11.6240857Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:11.6241678Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:11.6242415Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:11.6242961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.6243654Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.6244395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:11.6245107Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:11.6245823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:11.6246536Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:11.6247149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:11.6247657Z     fn()
2025-05-07T20:33:11.6248156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:11.6248732Z     self.fn.run(
2025-05-07T20:33:11.6249195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.6249709Z     kernel = self.compile(
2025-05-07T20:33:11.6250247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.6250888Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.6251276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.6251503Z 
2025-05-07T20:33:11.6251715Z self = <triton.compiler.compiler.ASTSource object at 0x7f38ffe3e0d0>
2025-05-07T20:33:11.6252968Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.6254389Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38fec3e200>}
2025-05-07T20:33:11.6255703Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.6256761Z context = <triton._C.libtriton.ir.context object at 0x7f38fea262b0>
2025-05-07T20:33:11.6257114Z 
2025-05-07T20:33:11.6257277Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.6257805Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.6258267Z                            module_map=module_map)
2025-05-07T20:33:11.6258623Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.6258975Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:11.6259231Z E       ^
2025-05-07T20:33:11.6259676Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.6260128Z 
2025-05-07T20:33:11.6260559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:11.6261067Z 
2025-05-07T20:33:11.6261168Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:11.6261578Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:11.6261974Z     T=16384,
2025-05-07T20:33:11.6262159Z     D=7168,
2025-05-07T20:33:11.6262345Z     scale_ub=1200.0,
2025-05-07T20:33:11.6262563Z     contiguous=False,
2025-05-07T20:33:11.6262786Z     compiled=False,
2025-05-07T20:33:11.6262983Z )
2025-05-07T20:33:11.6263284Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:11.6263775Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:11.6264062Z 
2025-05-07T20:33:11.6264135Z     @given(
2025-05-07T20:33:11.6264358Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:11.6264658Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:11.6264956Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:11.6265278Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:11.6265591Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:11.6265870Z     )
2025-05-07T20:33:11.6266217Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:11.6266696Z     def test_silu_mul_quant(
2025-05-07T20:33:11.6266939Z         self,
2025-05-07T20:33:11.6267124Z         T: int,
2025-05-07T20:33:11.6267306Z         D: int,
2025-05-07T20:33:11.6267572Z         scale_ub: Optional[float],
2025-05-07T20:33:11.6267841Z         contiguous: bool,
2025-05-07T20:33:11.6268074Z         compiled: bool,
2025-05-07T20:33:11.6268284Z     ) -> None:
2025-05-07T20:33:11.6268495Z         torch.manual_seed(2025)
2025-05-07T20:33:11.6268732Z     
2025-05-07T20:33:11.6268992Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:11.6269329Z     
2025-05-07T20:33:11.6269516Z         x_sign = torch.sign(x)
2025-05-07T20:33:11.6269797Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:11.6270098Z         x = x_sign * x_clamp
2025-05-07T20:33:11.6270336Z         x0 = x[:, :D]
2025-05-07T20:33:11.6270540Z         x1 = x[:, D:]
2025-05-07T20:33:11.6270743Z     
2025-05-07T20:33:11.6270920Z         if contiguous:
2025-05-07T20:33:11.6271139Z             x0 = x0.contiguous()
2025-05-07T20:33:11.6271479Z             x1 = x1.contiguous()
2025-05-07T20:33:11.6271716Z     
2025-05-07T20:33:11.6271900Z         if scale_ub is not None:
2025-05-07T20:33:11.6272170Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:11.6272503Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:11.6272808Z             )
2025-05-07T20:33:11.6272988Z         else:
2025-05-07T20:33:11.6273188Z             scale_ub_tensor = None
2025-05-07T20:33:11.6273434Z     
2025-05-07T20:33:11.6273652Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:11.6273961Z             op = silu_mul_quant
2025-05-07T20:33:11.6274203Z             if compiled:
2025-05-07T20:33:11.6274438Z                 op = torch.compile(op)
2025-05-07T20:33:11.6274774Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.6275036Z     
2025-05-07T20:33:11.6275214Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:11.6275388Z 
2025-05-07T20:33:11.6275484Z moe/activation_test.py:117: 
2025-05-07T20:33:11.6275783Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.6276095Z moe/activation_test.py:115: in fn
2025-05-07T20:33:11.6276370Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:11.6277046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:11.6277721Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:11.6278246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:11.6278912Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:11.6279566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:11.6280091Z     kernel = self.compile(
2025-05-07T20:33:11.6280631Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:11.6281275Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:11.6281666Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:11.6281886Z 
2025-05-07T20:33:11.6282091Z self = <triton.compiler.compiler.ASTSource object at 0x7f38febe1220>
2025-05-07T20:33:11.6283149Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:11.6284550Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38fee484a0>}
2025-05-07T20:33:11.6285922Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:11.6286988Z context = <triton._C.libtriton.ir.context object at 0x7f38fea4dd30>
2025-05-07T20:33:11.6287268Z 
2025-05-07T20:33:11.6287430Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:11.6287947Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:11.6288423Z                            module_map=module_map)
2025-05-07T20:33:11.6288778Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:11.6289140Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:11.6289401Z E       ^
2025-05-07T20:33:11.6289863Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:11.6290311Z 
2025-05-07T20:33:11.6290741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:12.3285931Z 
2025-05-07T20:33:12.3286273Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.3286784Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.3287392Z     T=1,
2025-05-07T20:33:12.3287634Z     D=7168,
2025-05-07T20:33:12.3287868Z     scale_ub=None,
2025-05-07T20:33:12.3288146Z     contiguous=True,
2025-05-07T20:33:12.3288430Z     compiled=True,
2025-05-07T20:33:12.3288630Z )
2025-05-07T20:33:12.3288952Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.3289430Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:12.3289686Z 
2025-05-07T20:33:12.3289762Z     @given(
2025-05-07T20:33:12.3289988Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.3290410Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.3290709Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.3291040Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.3291377Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.3291654Z     )
2025-05-07T20:33:12.3291991Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.3292425Z     def test_silu_mul_quant(
2025-05-07T20:33:12.3292659Z         self,
2025-05-07T20:33:12.3292844Z         T: int,
2025-05-07T20:33:12.3293039Z         D: int,
2025-05-07T20:33:12.3293253Z         scale_ub: Optional[float],
2025-05-07T20:33:12.3293515Z         contiguous: bool,
2025-05-07T20:33:12.3293756Z         compiled: bool,
2025-05-07T20:33:12.3293984Z     ) -> None:
2025-05-07T20:33:12.3294215Z         torch.manual_seed(2025)
2025-05-07T20:33:12.3294477Z     
2025-05-07T20:33:12.3294754Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.3295094Z     
2025-05-07T20:33:12.3301527Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.3301854Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.3302184Z         x = x_sign * x_clamp
2025-05-07T20:33:12.3302430Z         x0 = x[:, :D]
2025-05-07T20:33:12.3302641Z         x1 = x[:, D:]
2025-05-07T20:33:12.3302850Z     
2025-05-07T20:33:12.3303037Z         if contiguous:
2025-05-07T20:33:12.3303259Z             x0 = x0.contiguous()
2025-05-07T20:33:12.3303514Z             x1 = x1.contiguous()
2025-05-07T20:33:12.3303754Z     
2025-05-07T20:33:12.3303935Z         if scale_ub is not None:
2025-05-07T20:33:12.3304207Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:12.3304546Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:12.3304852Z             )
2025-05-07T20:33:12.3305038Z         else:
2025-05-07T20:33:12.3305245Z             scale_ub_tensor = None
2025-05-07T20:33:12.3305495Z     
2025-05-07T20:33:12.3305718Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.3306141Z             op = silu_mul_quant
2025-05-07T20:33:12.3306391Z             if compiled:
2025-05-07T20:33:12.3306632Z                 op = torch.compile(op)
2025-05-07T20:33:12.3306921Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.3307193Z     
2025-05-07T20:33:12.3307376Z         y_fp8, y_scale = fn()
2025-05-07T20:33:12.3307715Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:12.3307997Z     
2025-05-07T20:33:12.3308219Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.3308547Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:12.3308831Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:12.3309138Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:12.3309482Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:12.3309787Z     
2025-05-07T20:33:12.3309984Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:12.3310177Z 
2025-05-07T20:33:12.3310275Z moe/activation_test.py:126: 
2025-05-07T20:33:12.3310623Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.3311021Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:12.3311340Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:12.3312145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:12.3312883Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:12.3313420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:12.3314092Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:12.3314777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:12.3315544Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:12.3316264Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:12.3316915Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:12.3317518Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:12.3318027Z     fn()
2025-05-07T20:33:12.3318545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:12.3319115Z     self.fn.run(
2025-05-07T20:33:12.3319580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:12.3320100Z     kernel = self.compile(
2025-05-07T20:33:12.3320636Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:12.3321282Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:12.3321674Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.3321898Z 
2025-05-07T20:33:12.3322100Z self = <triton.compiler.compiler.ASTSource object at 0x7f38febe3950>
2025-05-07T20:33:12.3323194Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:12.3324557Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38fef04ea0>}
2025-05-07T20:33:12.3325915Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:12.3327021Z context = <triton._C.libtriton.ir.context object at 0x7f38d9fd3e70>
2025-05-07T20:33:12.3327306Z 
2025-05-07T20:33:12.3327469Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:12.3327982Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:12.3328446Z                            module_map=module_map)
2025-05-07T20:33:12.3328808Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:12.3329159Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:12.3329590Z E       ^
2025-05-07T20:33:12.3330052Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:12.3330502Z 
2025-05-07T20:33:12.3330928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:12.3331438Z 
2025-05-07T20:33:12.3331536Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:12.3331986Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:12.3332415Z     T=4096,
2025-05-07T20:33:12.3332592Z     D=5120,
2025-05-07T20:33:12.3332778Z     scale_ub=None,
2025-05-07T20:33:12.3332992Z     contiguous=False,
2025-05-07T20:33:12.3333206Z     compiled=False,
2025-05-07T20:33:12.3333410Z )
2025-05-07T20:33:12.3333722Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:12.3334202Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:12.3334474Z 
2025-05-07T20:33:12.3334549Z     @given(
2025-05-07T20:33:12.3334773Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:12.3335072Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:12.3335379Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:12.3335752Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:12.3336072Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:12.3336343Z     )
2025-05-07T20:33:12.3336692Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:12.3337130Z     def test_silu_mul_quant(
2025-05-07T20:33:12.3337365Z         self,
2025-05-07T20:33:12.3337553Z         T: int,
2025-05-07T20:33:12.3337746Z         D: int,
2025-05-07T20:33:12.3337954Z         scale_ub: Optional[float],
2025-05-07T20:33:12.3338222Z         contiguous: bool,
2025-05-07T20:33:12.3338458Z         compiled: bool,
2025-05-07T20:33:12.3338667Z     ) -> None:
2025-05-07T20:33:12.3338878Z         torch.manual_seed(2025)
2025-05-07T20:33:12.3339116Z     
2025-05-07T20:33:12.3339375Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:12.3339711Z     
2025-05-07T20:33:12.3339903Z         x_sign = torch.sign(x)
2025-05-07T20:33:12.3340452Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:12.3340750Z         x = x_sign * x_clamp
2025-05-07T20:33:12.3340992Z         x0 = x[:, :D]
2025-05-07T20:33:12.3341208Z         x1 = x[:, D:]
2025-05-07T20:33:12.3341415Z     
2025-05-07T20:33:12.3341601Z         if contiguous:
2025-05-07T20:33:12.3341834Z             x0 = x0.contiguous()
2025-05-07T20:33:12.3342085Z             x1 = x1.contiguous()
2025-05-07T20:33:12.3342321Z     
2025-05-07T20:33:12.3342507Z         if scale_ub is not None:
2025-05-07T20:33:12.3342770Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:12.3343098Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:12.3343403Z             )
2025-05-07T20:33:12.3343589Z         else:
2025-05-07T20:33:12.3343798Z             scale_ub_tensor = None
2025-05-07T20:33:12.3344048Z     
2025-05-07T20:33:12.3344273Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:12.3344620Z             op = silu_mul_quant
2025-05-07T20:33:12.3344867Z             if compiled:
2025-05-07T20:33:12.3345111Z                 op = torch.compile(op)
2025-05-07T20:33:12.3345472Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.3345748Z     
2025-05-07T20:33:12.3345937Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:12.3346094Z 
2025-05-07T20:33:12.3346191Z moe/activation_test.py:117: 
2025-05-07T20:33:12.3346482Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.3346803Z moe/activation_test.py:115: in fn
2025-05-07T20:33:12.3347074Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:12.3347800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:12.3348476Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:12.3349030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:12.3349695Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:12.3350352Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:12.3351018Z     kernel = self.compile(
2025-05-07T20:33:12.3351574Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:12.3352244Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:12.3352634Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:12.3352853Z 
2025-05-07T20:33:12.3353061Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d9f10cb0>
2025-05-07T20:33:12.3354163Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:12.3355572Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38fe31c2c0>}
2025-05-07T20:33:12.3356886Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:12.3357944Z context = <triton._C.libtriton.ir.context object at 0x7f38fe62f530>
2025-05-07T20:33:12.3358223Z 
2025-05-07T20:33:12.3358390Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:12.3358900Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:12.3359369Z                            module_map=module_map)
2025-05-07T20:33:12.3359729Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:12.3360073Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:12.3360328Z E       ^
2025-05-07T20:33:12.3360787Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:12.3361229Z 
2025-05-07T20:33:12.3361661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.0402042Z 
2025-05-07T20:33:13.0402343Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0402781Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0403267Z     T=4096,
2025-05-07T20:33:13.0403529Z     D=7168,
2025-05-07T20:33:13.0403790Z     scale_ub=None,
2025-05-07T20:33:13.0404088Z     contiguous=False,
2025-05-07T20:33:13.0404581Z     compiled=False,
2025-05-07T20:33:13.0404967Z )
2025-05-07T20:33:13.0405593Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0406574Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:13.0407106Z 
2025-05-07T20:33:13.0407261Z     @given(
2025-05-07T20:33:13.0407926Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0408543Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0409132Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0409756Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0410384Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0410934Z     )
2025-05-07T20:33:13.0411623Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0412485Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0412952Z         self,
2025-05-07T20:33:13.0413320Z         T: int,
2025-05-07T20:33:13.0413690Z         D: int,
2025-05-07T20:33:13.0414112Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0414490Z         contiguous: bool,
2025-05-07T20:33:13.0414725Z         compiled: bool,
2025-05-07T20:33:13.0414950Z     ) -> None:
2025-05-07T20:33:13.0415156Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0415387Z     
2025-05-07T20:33:13.0415660Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0416120Z     
2025-05-07T20:33:13.0416305Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.0416586Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.0416885Z         x = x_sign * x_clamp
2025-05-07T20:33:13.0417116Z         x0 = x[:, :D]
2025-05-07T20:33:13.0417324Z         x1 = x[:, D:]
2025-05-07T20:33:13.0417529Z     
2025-05-07T20:33:13.0417723Z         if contiguous:
2025-05-07T20:33:13.0417956Z             x0 = x0.contiguous()
2025-05-07T20:33:13.0418218Z             x1 = x1.contiguous()
2025-05-07T20:33:13.0418449Z     
2025-05-07T20:33:13.0418643Z         if scale_ub is not None:
2025-05-07T20:33:13.0418914Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.0419242Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.0419624Z             )
2025-05-07T20:33:13.0419814Z         else:
2025-05-07T20:33:13.0420018Z             scale_ub_tensor = None
2025-05-07T20:33:13.0420275Z     
2025-05-07T20:33:13.0420513Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.0420815Z             op = silu_mul_quant
2025-05-07T20:33:13.0421070Z             if compiled:
2025-05-07T20:33:13.0421321Z                 op = torch.compile(op)
2025-05-07T20:33:13.0421608Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0421880Z     
2025-05-07T20:33:13.0422079Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:13.0422242Z 
2025-05-07T20:33:13.0422351Z moe/activation_test.py:117: 
2025-05-07T20:33:13.0422638Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0422966Z moe/activation_test.py:115: in fn
2025-05-07T20:33:13.0423246Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0423954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:13.0424692Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:13.0425235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.0425931Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.0426582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.0427107Z     kernel = self.compile(
2025-05-07T20:33:13.0427727Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.0428370Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.0428770Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0429002Z 
2025-05-07T20:33:13.0429205Z self = <triton.compiler.compiler.ASTSource object at 0x7f38fec67020>
2025-05-07T20:33:13.0430324Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.0431671Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38fe31d300>}
2025-05-07T20:33:13.0433030Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.0434041Z context = <triton._C.libtriton.ir.context object at 0x7f38d9ce8bf0>
2025-05-07T20:33:13.0434326Z 
2025-05-07T20:33:13.0434494Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.0435017Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.0435476Z                            module_map=module_map)
2025-05-07T20:33:13.0435916Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.0436266Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:13.0436516Z E       ^
2025-05-07T20:33:13.0436975Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.0437413Z 
2025-05-07T20:33:13.0437827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.0438328Z 
2025-05-07T20:33:13.0438435Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.0438838Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.0439233Z     T=128,
2025-05-07T20:33:13.0439464Z     D=7168,
2025-05-07T20:33:13.0439647Z     scale_ub=None,
2025-05-07T20:33:13.0439865Z     contiguous=False,
2025-05-07T20:33:13.0440352Z     compiled=True,
2025-05-07T20:33:13.0440554Z )
2025-05-07T20:33:13.0440879Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.0441362Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:13.0441622Z 
2025-05-07T20:33:13.0441702Z     @given(
2025-05-07T20:33:13.0441927Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.0442236Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.0442546Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.0442866Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.0443189Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.0443472Z     )
2025-05-07T20:33:13.0443814Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.0444259Z     def test_silu_mul_quant(
2025-05-07T20:33:13.0444502Z         self,
2025-05-07T20:33:13.0444688Z         T: int,
2025-05-07T20:33:13.0444888Z         D: int,
2025-05-07T20:33:13.0445103Z         scale_ub: Optional[float],
2025-05-07T20:33:13.0445367Z         contiguous: bool,
2025-05-07T20:33:13.0445603Z         compiled: bool,
2025-05-07T20:33:13.0445826Z     ) -> None:
2025-05-07T20:33:13.0446036Z         torch.manual_seed(2025)
2025-05-07T20:33:13.0446270Z     
2025-05-07T20:33:13.0446542Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.0446881Z     
2025-05-07T20:33:13.0447067Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.0447352Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.0447658Z         x = x_sign * x_clamp
2025-05-07T20:33:13.0447889Z         x0 = x[:, :D]
2025-05-07T20:33:13.0448104Z         x1 = x[:, D:]
2025-05-07T20:33:13.0448309Z     
2025-05-07T20:33:13.0448486Z         if contiguous:
2025-05-07T20:33:13.0448720Z             x0 = x0.contiguous()
2025-05-07T20:33:13.0448974Z             x1 = x1.contiguous()
2025-05-07T20:33:13.0449206Z     
2025-05-07T20:33:13.0449469Z         if scale_ub is not None:
2025-05-07T20:33:13.0449743Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.0450070Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.0450377Z             )
2025-05-07T20:33:13.0450566Z         else:
2025-05-07T20:33:13.0450768Z             scale_ub_tensor = None
2025-05-07T20:33:13.0451015Z     
2025-05-07T20:33:13.0451239Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.0451543Z             op = silu_mul_quant
2025-05-07T20:33:13.0451785Z             if compiled:
2025-05-07T20:33:13.0452032Z                 op = torch.compile(op)
2025-05-07T20:33:13.0452326Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.0452591Z     
2025-05-07T20:33:13.0452784Z         y_fp8, y_scale = fn()
2025-05-07T20:33:13.0453075Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:13.0453355Z     
2025-05-07T20:33:13.0453590Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.0453925Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:13.0454329Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:13.0454689Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:13.0455042Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:13.0455350Z     
2025-05-07T20:33:13.0455544Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:13.0455745Z 
2025-05-07T20:33:13.0455842Z moe/activation_test.py:126: 
2025-05-07T20:33:13.0456136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0456459Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:13.0456787Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:13.0457572Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:13.0458442Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:13.0458975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.0459678Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.0460358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:13.0461088Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:13.0461810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:13.0462443Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:13.0463041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:13.0463553Z     fn()
2025-05-07T20:33:13.0464077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:13.0464654Z     self.fn.run(
2025-05-07T20:33:13.0465113Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.0465633Z     kernel = self.compile(
2025-05-07T20:33:13.0466180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.0466819Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.0467201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.0467481Z 
2025-05-07T20:33:13.0467686Z self = <triton.compiler.compiler.ASTSource object at 0x7f38fe3c99d0>
2025-05-07T20:33:13.0468798Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.0470201Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38fe31fe20>}
2025-05-07T20:33:13.0471512Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.0472521Z context = <triton._C.libtriton.ir.context object at 0x7f38fe4232f0>
2025-05-07T20:33:13.0472809Z 
2025-05-07T20:33:13.0472973Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.0473491Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.0473949Z                            module_map=module_map)
2025-05-07T20:33:13.0474310Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.0474666Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:13.0475041Z E       ^
2025-05-07T20:33:13.0475493Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.0475945Z 
2025-05-07T20:33:13.0476369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.2862170Z 
2025-05-07T20:33:13.2862588Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.2863231Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.2863783Z     T=128,
2025-05-07T20:33:13.2864051Z     D=7168,
2025-05-07T20:33:13.2864326Z     scale_ub=None,
2025-05-07T20:33:13.2864614Z     contiguous=False,
2025-05-07T20:33:13.2865068Z     compiled=False,
2025-05-07T20:33:13.2865279Z )
2025-05-07T20:33:13.2865597Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.2866091Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:13.2866384Z 
2025-05-07T20:33:13.2866462Z     @given(
2025-05-07T20:33:13.2866693Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.2866996Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.2867314Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.2867738Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.2868062Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.2868350Z     )
2025-05-07T20:33:13.2868709Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.2869200Z     def test_silu_mul_quant(
2025-05-07T20:33:13.2869442Z         self,
2025-05-07T20:33:13.2869629Z         T: int,
2025-05-07T20:33:13.2869836Z         D: int,
2025-05-07T20:33:13.2870054Z         scale_ub: Optional[float],
2025-05-07T20:33:13.2870318Z         contiguous: bool,
2025-05-07T20:33:13.2870564Z         compiled: bool,
2025-05-07T20:33:13.2870791Z     ) -> None:
2025-05-07T20:33:13.2871000Z         torch.manual_seed(2025)
2025-05-07T20:33:13.2871235Z     
2025-05-07T20:33:13.2871506Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.2871840Z     
2025-05-07T20:33:13.2872028Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.2872316Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.2872625Z         x = x_sign * x_clamp
2025-05-07T20:33:13.2872855Z         x0 = x[:, :D]
2025-05-07T20:33:13.2873073Z         x1 = x[:, D:]
2025-05-07T20:33:13.2873274Z     
2025-05-07T20:33:13.2873447Z         if contiguous:
2025-05-07T20:33:13.2873676Z             x0 = x0.contiguous()
2025-05-07T20:33:13.2873929Z             x1 = x1.contiguous()
2025-05-07T20:33:13.2874165Z     
2025-05-07T20:33:13.2874354Z         if scale_ub is not None:
2025-05-07T20:33:13.2874624Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.2875031Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.2875351Z             )
2025-05-07T20:33:13.2881439Z         else:
2025-05-07T20:33:13.2881677Z             scale_ub_tensor = None
2025-05-07T20:33:13.2881930Z     
2025-05-07T20:33:13.2882155Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.2882459Z             op = silu_mul_quant
2025-05-07T20:33:13.2882708Z             if compiled:
2025-05-07T20:33:13.2882948Z                 op = torch.compile(op)
2025-05-07T20:33:13.2883232Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.2883526Z     
2025-05-07T20:33:13.2883723Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:13.2883903Z 
2025-05-07T20:33:13.2884009Z moe/activation_test.py:117: 
2025-05-07T20:33:13.2884330Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.2884739Z moe/activation_test.py:115: in fn
2025-05-07T20:33:13.2885026Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.2885828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:13.2886559Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:13.2887107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.2887780Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.2888425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.2888940Z     kernel = self.compile(
2025-05-07T20:33:13.2889482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.2890201Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.2890585Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.2890814Z 
2025-05-07T20:33:13.2891019Z self = <triton.compiler.compiler.ASTSource object at 0x7f38feaa0f50>
2025-05-07T20:33:13.2892136Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.2893487Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d9e787c0>}
2025-05-07T20:33:13.2894833Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.2895839Z context = <triton._C.libtriton.ir.context object at 0x7f38d96c6f70>
2025-05-07T20:33:13.2896126Z 
2025-05-07T20:33:13.2896299Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.2896817Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.2897281Z                            module_map=module_map)
2025-05-07T20:33:13.2897635Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.2897980Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:13.2898225Z E       ^
2025-05-07T20:33:13.2898696Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.2899159Z 
2025-05-07T20:33:13.2899588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.2900090Z 
2025-05-07T20:33:13.2900195Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.2900599Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.2900991Z     T=4096,
2025-05-07T20:33:13.2901218Z     D=5120,
2025-05-07T20:33:13.2901406Z     scale_ub=1200.0,
2025-05-07T20:33:13.2901623Z     contiguous=True,
2025-05-07T20:33:13.2901834Z     compiled=False,
2025-05-07T20:33:13.2902028Z )
2025-05-07T20:33:13.2902337Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.2902813Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:13.2903075Z 
2025-05-07T20:33:13.2903150Z     @given(
2025-05-07T20:33:13.2903362Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.2903657Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.2903948Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.2904257Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.2904571Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.2904850Z     )
2025-05-07T20:33:13.2905183Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.2905610Z     def test_silu_mul_quant(
2025-05-07T20:33:13.2905930Z         self,
2025-05-07T20:33:13.2906109Z         T: int,
2025-05-07T20:33:13.2906482Z         D: int,
2025-05-07T20:33:13.2906692Z         scale_ub: Optional[float],
2025-05-07T20:33:13.2906951Z         contiguous: bool,
2025-05-07T20:33:13.2907182Z         compiled: bool,
2025-05-07T20:33:13.2907476Z     ) -> None:
2025-05-07T20:33:13.2907707Z         torch.manual_seed(2025)
2025-05-07T20:33:13.2907934Z     
2025-05-07T20:33:13.2908196Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.2908527Z     
2025-05-07T20:33:13.2908703Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.2908982Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.2909279Z         x = x_sign * x_clamp
2025-05-07T20:33:13.2909557Z         x0 = x[:, :D]
2025-05-07T20:33:13.2909763Z         x1 = x[:, D:]
2025-05-07T20:33:13.2909959Z     
2025-05-07T20:33:13.2910132Z         if contiguous:
2025-05-07T20:33:13.2910355Z             x0 = x0.contiguous()
2025-05-07T20:33:13.2910601Z             x1 = x1.contiguous()
2025-05-07T20:33:13.2910821Z     
2025-05-07T20:33:13.2911002Z         if scale_ub is not None:
2025-05-07T20:33:13.2911266Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.2911583Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.2911885Z             )
2025-05-07T20:33:13.2912069Z         else:
2025-05-07T20:33:13.2912266Z             scale_ub_tensor = None
2025-05-07T20:33:13.2912497Z     
2025-05-07T20:33:13.2912712Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.2913006Z             op = silu_mul_quant
2025-05-07T20:33:13.2913238Z             if compiled:
2025-05-07T20:33:13.2913471Z                 op = torch.compile(op)
2025-05-07T20:33:13.2913833Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.2914198Z     
2025-05-07T20:33:13.2914450Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:13.2914705Z 
2025-05-07T20:33:13.2914856Z moe/activation_test.py:117: 
2025-05-07T20:33:13.2915154Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.2915471Z moe/activation_test.py:115: in fn
2025-05-07T20:33:13.2915745Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.2916446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:13.2917119Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:13.2917655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.2918326Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.2918993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.2919510Z     kernel = self.compile(
2025-05-07T20:33:13.2920126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.2920794Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.2921175Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.2921402Z 
2025-05-07T20:33:13.2921634Z self = <triton.compiler.compiler.ASTSource object at 0x7f38feaa3850>
2025-05-07T20:33:13.2922758Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.2924187Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d9e78c20>}
2025-05-07T20:33:13.2925902Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.2926971Z context = <triton._C.libtriton.ir.context object at 0x7f38fe4b7230>
2025-05-07T20:33:13.2927256Z 
2025-05-07T20:33:13.2927417Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.2927933Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.2928397Z                            module_map=module_map)
2025-05-07T20:33:13.2928774Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.2929166Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:13.2929444Z E       ^
2025-05-07T20:33:13.2929975Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.2930567Z 
2025-05-07T20:33:13.2931074Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.2931695Z 
2025-05-07T20:33:13.2931806Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.2932267Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.2932723Z     T=1,
2025-05-07T20:33:13.2932910Z     D=5120,
2025-05-07T20:33:13.2933109Z     scale_ub=None,
2025-05-07T20:33:13.2933331Z     contiguous=True,
2025-05-07T20:33:13.2933560Z     compiled=True,
2025-05-07T20:33:13.2933771Z )
2025-05-07T20:33:13.2934115Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.2934662Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:13.2934956Z 
2025-05-07T20:33:13.2935035Z     @given(
2025-05-07T20:33:13.2935272Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.2935616Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.2935950Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.2936314Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.2936673Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.2936979Z     )
2025-05-07T20:33:13.2937365Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.2937864Z     def test_silu_mul_quant(
2025-05-07T20:33:13.2938115Z         self,
2025-05-07T20:33:13.2938311Z         T: int,
2025-05-07T20:33:13.2938506Z         D: int,
2025-05-07T20:33:13.2938732Z         scale_ub: Optional[float],
2025-05-07T20:33:13.2939021Z         contiguous: bool,
2025-05-07T20:33:13.2939266Z         compiled: bool,
2025-05-07T20:33:13.2939497Z     ) -> None:
2025-05-07T20:33:13.2939722Z         torch.manual_seed(2025)
2025-05-07T20:33:13.2939974Z     
2025-05-07T20:33:13.2940682Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.2941154Z     
2025-05-07T20:33:13.2941431Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.2941722Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.2942022Z         x = x_sign * x_clamp
2025-05-07T20:33:13.2942250Z         x0 = x[:, :D]
2025-05-07T20:33:13.2942453Z         x1 = x[:, D:]
2025-05-07T20:33:13.2942648Z     
2025-05-07T20:33:13.2942817Z         if contiguous:
2025-05-07T20:33:13.2943029Z             x0 = x0.contiguous()
2025-05-07T20:33:13.2943271Z             x1 = x1.contiguous()
2025-05-07T20:33:13.2943493Z     
2025-05-07T20:33:13.2943665Z         if scale_ub is not None:
2025-05-07T20:33:13.2943919Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.2944240Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.2944549Z             )
2025-05-07T20:33:13.2944755Z         else:
2025-05-07T20:33:13.2944952Z             scale_ub_tensor = None
2025-05-07T20:33:13.2945184Z     
2025-05-07T20:33:13.2945398Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.2945695Z             op = silu_mul_quant
2025-05-07T20:33:13.2946047Z             if compiled:
2025-05-07T20:33:13.2946291Z                 op = torch.compile(op)
2025-05-07T20:33:13.2946574Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.2946828Z     
2025-05-07T20:33:13.2947007Z         y_fp8, y_scale = fn()
2025-05-07T20:33:13.2947277Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:13.2947617Z     
2025-05-07T20:33:13.2947835Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.2948151Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:13.2948430Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:13.2948722Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:13.2949058Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:13.2949477Z     
2025-05-07T20:33:13.2949659Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:13.2949845Z 
2025-05-07T20:33:13.2949940Z moe/activation_test.py:126: 
2025-05-07T20:33:13.2950227Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.2950545Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:13.2950856Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:13.2951626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:13.2952357Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:13.2952894Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.2953552Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.2954221Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:13.2954925Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:13.2955664Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:13.2956288Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:13.2956895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:13.2957402Z     fn()
2025-05-07T20:33:13.2957913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:13.2958507Z     self.fn.run(
2025-05-07T20:33:13.2958978Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.2959491Z     kernel = self.compile(
2025-05-07T20:33:13.2960029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.2960739Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.2961129Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.2961351Z 
2025-05-07T20:33:13.2961552Z self = <triton.compiler.compiler.ASTSource object at 0x7f38fe666a80>
2025-05-07T20:33:13.2962624Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.2963972Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d9e7a980>}
2025-05-07T20:33:13.2965284Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.2966401Z context = <triton._C.libtriton.ir.context object at 0x7f38fe49a930>
2025-05-07T20:33:13.2966754Z 
2025-05-07T20:33:13.2966917Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.2967425Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.2967887Z                            module_map=module_map)
2025-05-07T20:33:13.2968236Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.2968583Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:13.2968840Z E       ^
2025-05-07T20:33:13.2969288Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.2969760Z 
2025-05-07T20:33:13.2970228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.9914830Z 
2025-05-07T20:33:13.9915213Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.9915645Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.9916108Z     T=2048,
2025-05-07T20:33:13.9916374Z     D=5120,
2025-05-07T20:33:13.9916631Z     scale_ub=None,
2025-05-07T20:33:13.9916913Z     contiguous=True,
2025-05-07T20:33:13.9917196Z     compiled=True,
2025-05-07T20:33:13.9917500Z )
2025-05-07T20:33:13.9917866Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.9918344Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:13.9918611Z 
2025-05-07T20:33:13.9918688Z     @given(
2025-05-07T20:33:13.9918913Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.9919217Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.9919518Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.9919842Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.9920170Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.9920448Z     )
2025-05-07T20:33:13.9920803Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.9921248Z     def test_silu_mul_quant(
2025-05-07T20:33:13.9921483Z         self,
2025-05-07T20:33:13.9921675Z         T: int,
2025-05-07T20:33:13.9921866Z         D: int,
2025-05-07T20:33:13.9922072Z         scale_ub: Optional[float],
2025-05-07T20:33:13.9922340Z         contiguous: bool,
2025-05-07T20:33:13.9922575Z         compiled: bool,
2025-05-07T20:33:13.9922792Z     ) -> None:
2025-05-07T20:33:13.9923005Z         torch.manual_seed(2025)
2025-05-07T20:33:13.9923245Z     
2025-05-07T20:33:13.9923511Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.9923845Z     
2025-05-07T20:33:13.9924035Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.9924327Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.9924784Z         x = x_sign * x_clamp
2025-05-07T20:33:13.9925041Z         x0 = x[:, :D]
2025-05-07T20:33:13.9925255Z         x1 = x[:, D:]
2025-05-07T20:33:13.9925452Z     
2025-05-07T20:33:13.9925627Z         if contiguous:
2025-05-07T20:33:13.9925852Z             x0 = x0.contiguous()
2025-05-07T20:33:13.9926096Z             x1 = x1.contiguous()
2025-05-07T20:33:13.9926330Z     
2025-05-07T20:33:13.9926517Z         if scale_ub is not None:
2025-05-07T20:33:13.9926778Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.9927108Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.9927409Z             )
2025-05-07T20:33:13.9927596Z         else:
2025-05-07T20:33:13.9927799Z             scale_ub_tensor = None
2025-05-07T20:33:13.9928045Z     
2025-05-07T20:33:13.9928267Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.9928583Z             op = silu_mul_quant
2025-05-07T20:33:13.9928832Z             if compiled:
2025-05-07T20:33:13.9929075Z                 op = torch.compile(op)
2025-05-07T20:33:13.9929363Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.9929755Z     
2025-05-07T20:33:13.9929945Z         y_fp8, y_scale = fn()
2025-05-07T20:33:13.9930219Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:13.9930505Z     
2025-05-07T20:33:13.9930739Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.9931061Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:13.9931347Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:13.9931653Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:13.9931995Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:13.9932295Z     
2025-05-07T20:33:13.9932489Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:13.9932745Z 
2025-05-07T20:33:13.9932848Z moe/activation_test.py:126: 
2025-05-07T20:33:13.9933137Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.9933463Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:13.9933793Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:13.9934585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:13.9935328Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:13.9935869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.9936537Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.9937224Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:13.9937933Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:13.9938673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:13.9939304Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:13.9939897Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:13.9940917Z     fn()
2025-05-07T20:33:13.9941465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:13.9942053Z     self.fn.run(
2025-05-07T20:33:13.9942521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.9943046Z     kernel = self.compile(
2025-05-07T20:33:13.9943603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.9944274Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.9944763Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.9944992Z 
2025-05-07T20:33:13.9945214Z self = <triton.compiler.compiler.ASTSource object at 0x7f38fe666b70>
2025-05-07T20:33:13.9946295Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.9947754Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d9ec8860>}
2025-05-07T20:33:13.9949080Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.9950145Z context = <triton._C.libtriton.ir.context object at 0x7f38fe2228b0>
2025-05-07T20:33:13.9950430Z 
2025-05-07T20:33:13.9950602Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.9951228Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.9951698Z                            module_map=module_map)
2025-05-07T20:33:13.9952059Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.9952411Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:13.9952674Z E       ^
2025-05-07T20:33:13.9953146Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.9953611Z 
2025-05-07T20:33:13.9954041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:13.9954561Z 
2025-05-07T20:33:13.9954749Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:13.9955165Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:13.9955564Z     T=128,
2025-05-07T20:33:13.9955750Z     D=5120,
2025-05-07T20:33:13.9955938Z     scale_ub=None,
2025-05-07T20:33:13.9956151Z     contiguous=True,
2025-05-07T20:33:13.9956375Z     compiled=True,
2025-05-07T20:33:13.9956575Z )
2025-05-07T20:33:13.9956896Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:13.9957388Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:13.9957652Z 
2025-05-07T20:33:13.9957736Z     @given(
2025-05-07T20:33:13.9957962Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:13.9958266Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:13.9958560Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:13.9958971Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:13.9959441Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:13.9959833Z     )
2025-05-07T20:33:13.9960254Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:13.9960702Z     def test_silu_mul_quant(
2025-05-07T20:33:13.9960942Z         self,
2025-05-07T20:33:13.9961130Z         T: int,
2025-05-07T20:33:13.9961407Z         D: int,
2025-05-07T20:33:13.9961661Z         scale_ub: Optional[float],
2025-05-07T20:33:13.9961966Z         contiguous: bool,
2025-05-07T20:33:13.9962204Z         compiled: bool,
2025-05-07T20:33:13.9962423Z     ) -> None:
2025-05-07T20:33:13.9962628Z         torch.manual_seed(2025)
2025-05-07T20:33:13.9962866Z     
2025-05-07T20:33:13.9963129Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:13.9963458Z     
2025-05-07T20:33:13.9963641Z         x_sign = torch.sign(x)
2025-05-07T20:33:13.9963924Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:13.9964233Z         x = x_sign * x_clamp
2025-05-07T20:33:13.9964465Z         x0 = x[:, :D]
2025-05-07T20:33:13.9964679Z         x1 = x[:, D:]
2025-05-07T20:33:13.9964882Z     
2025-05-07T20:33:13.9965122Z         if contiguous:
2025-05-07T20:33:13.9965353Z             x0 = x0.contiguous()
2025-05-07T20:33:13.9965604Z             x1 = x1.contiguous()
2025-05-07T20:33:13.9965837Z     
2025-05-07T20:33:13.9966022Z         if scale_ub is not None:
2025-05-07T20:33:13.9966294Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:13.9966621Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:13.9966928Z             )
2025-05-07T20:33:13.9967119Z         else:
2025-05-07T20:33:13.9967319Z             scale_ub_tensor = None
2025-05-07T20:33:13.9967568Z     
2025-05-07T20:33:13.9967796Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.9968096Z             op = silu_mul_quant
2025-05-07T20:33:13.9968345Z             if compiled:
2025-05-07T20:33:13.9968595Z                 op = torch.compile(op)
2025-05-07T20:33:13.9968879Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:13.9969148Z     
2025-05-07T20:33:13.9969334Z         y_fp8, y_scale = fn()
2025-05-07T20:33:13.9969666Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:13.9969985Z     
2025-05-07T20:33:13.9975496Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:13.9975850Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:13.9976138Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:13.9976453Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:13.9976809Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:13.9977112Z     
2025-05-07T20:33:13.9977319Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:13.9977521Z 
2025-05-07T20:33:13.9977623Z moe/activation_test.py:126: 
2025-05-07T20:33:13.9977927Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.9978334Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:13.9978662Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:13.9979473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:13.9980275Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:13.9980816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:13.9981495Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:13.9982185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:13.9982898Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:13.9983614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:13.9984254Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:13.9984914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:13.9985415Z     fn()
2025-05-07T20:33:13.9985935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:13.9986527Z     self.fn.run(
2025-05-07T20:33:13.9986990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:13.9987600Z     kernel = self.compile(
2025-05-07T20:33:13.9988157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:13.9988802Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:13.9989188Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:13.9989419Z 
2025-05-07T20:33:13.9989627Z self = <triton.compiler.compiler.ASTSource object at 0x7f38fe259fd0>
2025-05-07T20:33:13.9990974Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:13.9992611Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d9836ca0>}
2025-05-07T20:33:13.9993929Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:13.9994933Z context = <triton._C.libtriton.ir.context object at 0x7f38d9d6eaf0>
2025-05-07T20:33:13.9995221Z 
2025-05-07T20:33:13.9995384Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:13.9995902Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:13.9996486Z                            module_map=module_map)
2025-05-07T20:33:13.9996840Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:13.9997186Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:13.9997443Z E       ^
2025-05-07T20:33:13.9997894Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:13.9998339Z 
2025-05-07T20:33:13.9998749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:14.7881018Z 
2025-05-07T20:33:14.7881231Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:14.7881842Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:14.7882566Z     T=4096,
2025-05-07T20:33:14.7882837Z     D=5120,
2025-05-07T20:33:14.7883089Z     scale_ub=None,
2025-05-07T20:33:14.7883374Z     contiguous=True,
2025-05-07T20:33:14.7883680Z     compiled=True,
2025-05-07T20:33:14.7883960Z )
2025-05-07T20:33:14.7884369Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:14.7885007Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:14.7885269Z 
2025-05-07T20:33:14.7885348Z     @given(
2025-05-07T20:33:14.7885565Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:14.7885872Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:14.7886170Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:14.7886488Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:14.7886822Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:14.7887112Z     )
2025-05-07T20:33:14.7887463Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:14.7887912Z     def test_silu_mul_quant(
2025-05-07T20:33:14.7888145Z         self,
2025-05-07T20:33:14.7888343Z         T: int,
2025-05-07T20:33:14.7888540Z         D: int,
2025-05-07T20:33:14.7888757Z         scale_ub: Optional[float],
2025-05-07T20:33:14.7889028Z         contiguous: bool,
2025-05-07T20:33:14.7889256Z         compiled: bool,
2025-05-07T20:33:14.7889479Z     ) -> None:
2025-05-07T20:33:14.7889681Z         torch.manual_seed(2025)
2025-05-07T20:33:14.7889907Z     
2025-05-07T20:33:14.7890175Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:14.7890515Z     
2025-05-07T20:33:14.7890740Z         x_sign = torch.sign(x)
2025-05-07T20:33:14.7891020Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:14.7891319Z         x = x_sign * x_clamp
2025-05-07T20:33:14.7891554Z         x0 = x[:, :D]
2025-05-07T20:33:14.7891764Z         x1 = x[:, D:]
2025-05-07T20:33:14.7891976Z     
2025-05-07T20:33:14.7892154Z         if contiguous:
2025-05-07T20:33:14.7892390Z             x0 = x0.contiguous()
2025-05-07T20:33:14.7892637Z             x1 = x1.contiguous()
2025-05-07T20:33:14.7892952Z     
2025-05-07T20:33:14.7893143Z         if scale_ub is not None:
2025-05-07T20:33:14.7893415Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:14.7893763Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:14.7894074Z             )
2025-05-07T20:33:14.7894268Z         else:
2025-05-07T20:33:14.7894467Z             scale_ub_tensor = None
2025-05-07T20:33:14.7894712Z     
2025-05-07T20:33:14.7894939Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:14.7895246Z             op = silu_mul_quant
2025-05-07T20:33:14.7895493Z             if compiled:
2025-05-07T20:33:14.7895742Z                 op = torch.compile(op)
2025-05-07T20:33:14.7896032Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:14.7896303Z     
2025-05-07T20:33:14.7896498Z         y_fp8, y_scale = fn()
2025-05-07T20:33:14.7896777Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:14.7897057Z     
2025-05-07T20:33:14.7897293Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:14.7897745Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:14.7898033Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:14.7898339Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:14.7898694Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:14.7898988Z     
2025-05-07T20:33:14.7899182Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:14.7899368Z 
2025-05-07T20:33:14.7899471Z moe/activation_test.py:126: 
2025-05-07T20:33:14.7899754Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:14.7900080Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:14.7900396Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:14.7901220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:14.7901953Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:14.7902503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:14.7903173Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:14.7903849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:14.7904562Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:14.7905279Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:14.7905902Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:14.7906497Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:14.7906999Z     fn()
2025-05-07T20:33:14.7907610Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:14.7908177Z     self.fn.run(
2025-05-07T20:33:14.7908635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:14.7909151Z     kernel = self.compile(
2025-05-07T20:33:14.7909700Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:14.7910328Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:14.7910710Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:14.7910932Z 
2025-05-07T20:33:14.7911132Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d9db2c10>
2025-05-07T20:33:14.7912247Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:14.7913605Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d974a200>}
2025-05-07T20:33:14.7914947Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:14.7915963Z context = <triton._C.libtriton.ir.context object at 0x7f38d94e45f0>
2025-05-07T20:33:14.7916248Z 
2025-05-07T20:33:14.7916415Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:14.7916920Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:14.7917376Z                            module_map=module_map)
2025-05-07T20:33:14.7917736Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:14.7918166Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:14.7918417Z E       ^
2025-05-07T20:33:14.7918862Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:14.7919299Z 
2025-05-07T20:33:14.7919726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:14.7920225Z 
2025-05-07T20:33:14.7920332Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:14.7920728Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:14.7921112Z     T=16384,
2025-05-07T20:33:14.7921297Z     D=5120,
2025-05-07T20:33:14.7921477Z     scale_ub=None,
2025-05-07T20:33:14.7921729Z     contiguous=True,
2025-05-07T20:33:14.7921942Z     compiled=True,
2025-05-07T20:33:14.7922127Z )
2025-05-07T20:33:14.7922435Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:14.7922916Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:14.7923177Z 
2025-05-07T20:33:14.7923248Z     @given(
2025-05-07T20:33:14.7923470Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:14.7923771Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:14.7924063Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:14.7924373Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:14.7924688Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:14.7924957Z     )
2025-05-07T20:33:14.7925292Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:14.7925722Z     def test_silu_mul_quant(
2025-05-07T20:33:14.7925950Z         self,
2025-05-07T20:33:14.7926131Z         T: int,
2025-05-07T20:33:14.7926317Z         D: int,
2025-05-07T20:33:14.7926525Z         scale_ub: Optional[float],
2025-05-07T20:33:14.7926787Z         contiguous: bool,
2025-05-07T20:33:14.7927022Z         compiled: bool,
2025-05-07T20:33:14.7927234Z     ) -> None:
2025-05-07T20:33:14.7927437Z         torch.manual_seed(2025)
2025-05-07T20:33:14.7927669Z     
2025-05-07T20:33:14.7927936Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:14.7928268Z     
2025-05-07T20:33:14.7928451Z         x_sign = torch.sign(x)
2025-05-07T20:33:14.7928733Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:14.7929211Z         x = x_sign * x_clamp
2025-05-07T20:33:14.7929441Z         x0 = x[:, :D]
2025-05-07T20:33:14.7929647Z         x1 = x[:, D:]
2025-05-07T20:33:14.7929851Z     
2025-05-07T20:33:14.7930024Z         if contiguous:
2025-05-07T20:33:14.7930244Z             x0 = x0.contiguous()
2025-05-07T20:33:14.7930491Z             x1 = x1.contiguous()
2025-05-07T20:33:14.7930716Z     
2025-05-07T20:33:14.7930899Z         if scale_ub is not None:
2025-05-07T20:33:14.7931242Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:14.7931585Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:14.7931890Z             )
2025-05-07T20:33:14.7932082Z         else:
2025-05-07T20:33:14.7932290Z             scale_ub_tensor = None
2025-05-07T20:33:14.7932535Z     
2025-05-07T20:33:14.7932761Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:14.7933061Z             op = silu_mul_quant
2025-05-07T20:33:14.7933300Z             if compiled:
2025-05-07T20:33:14.7933544Z                 op = torch.compile(op)
2025-05-07T20:33:14.7933862Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:14.7934245Z     
2025-05-07T20:33:14.7934496Z         y_fp8, y_scale = fn()
2025-05-07T20:33:14.7934888Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:14.7935243Z     
2025-05-07T20:33:14.7935470Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:14.7935799Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:14.7936091Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:14.7936491Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:14.7936844Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:14.7937148Z     
2025-05-07T20:33:14.7937337Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:14.7937530Z 
2025-05-07T20:33:14.7937627Z moe/activation_test.py:126: 
2025-05-07T20:33:14.7937920Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:14.7938249Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:14.7938561Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:14.7939336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:14.7940282Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:14.7940820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:14.7941494Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:14.7942172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:14.7942878Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:14.7943613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:14.7944248Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:14.7944873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:14.7945424Z     fn()
2025-05-07T20:33:14.7945921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:14.7946502Z     self.fn.run(
2025-05-07T20:33:14.7946965Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:14.7947546Z     kernel = self.compile(
2025-05-07T20:33:14.7948082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:14.7948720Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:14.7949109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:14.7949330Z 
2025-05-07T20:33:14.7949535Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d9d95d90>
2025-05-07T20:33:14.7950594Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:14.7952031Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d8d38680>}
2025-05-07T20:33:14.7953785Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:14.7954856Z context = <triton._C.libtriton.ir.context object at 0x7f38d99367f0>
2025-05-07T20:33:14.7955138Z 
2025-05-07T20:33:14.7955300Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:14.7955810Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:14.7956268Z                            module_map=module_map)
2025-05-07T20:33:14.7956627Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:14.7956980Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:14.7957241Z E       ^
2025-05-07T20:33:14.7957791Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:14.7958302Z 
2025-05-07T20:33:14.7958719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:14.8157923Z W0507 20:33:14.814000 89314 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:33:14.8160582Z W0507 20:33:14.814000 89314 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:33:14.8162961Z W0507 20:33:14.814000 89314 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:33:14.8164947Z W0507 20:33:14.814000 89314 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:33:14.8166127Z W0507 20:33:14.814000 89314 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:33:15.2257511Z 
2025-05-07T20:33:15.2257708Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:15.2258339Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:15.2258920Z     T=1,
2025-05-07T20:33:15.2259172Z     D=5120,
2025-05-07T20:33:15.2259437Z     scale_ub=1200.0,
2025-05-07T20:33:15.2259741Z     contiguous=True,
2025-05-07T20:33:15.2260042Z     compiled=True,
2025-05-07T20:33:15.2260312Z )
2025-05-07T20:33:15.2260694Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:15.2261192Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:15.2261450Z 
2025-05-07T20:33:15.2261533Z     @given(
2025-05-07T20:33:15.2261768Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:15.2262079Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:15.2262374Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:15.2262686Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:15.2262996Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:15.2263263Z     )
2025-05-07T20:33:15.2263591Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:15.2264030Z     def test_silu_mul_quant(
2025-05-07T20:33:15.2264258Z         self,
2025-05-07T20:33:15.2264440Z         T: int,
2025-05-07T20:33:15.2264634Z         D: int,
2025-05-07T20:33:15.2264854Z         scale_ub: Optional[float],
2025-05-07T20:33:15.2265115Z         contiguous: bool,
2025-05-07T20:33:15.2265354Z         compiled: bool,
2025-05-07T20:33:15.2265585Z     ) -> None:
2025-05-07T20:33:15.2265916Z         torch.manual_seed(2025)
2025-05-07T20:33:15.2266169Z     
2025-05-07T20:33:15.2266449Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:15.2266791Z     
2025-05-07T20:33:15.2266977Z         x_sign = torch.sign(x)
2025-05-07T20:33:15.2267264Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:15.2267664Z         x = x_sign * x_clamp
2025-05-07T20:33:15.2267893Z         x0 = x[:, :D]
2025-05-07T20:33:15.2268104Z         x1 = x[:, D:]
2025-05-07T20:33:15.2268310Z     
2025-05-07T20:33:15.2268482Z         if contiguous:
2025-05-07T20:33:15.2268712Z             x0 = x0.contiguous()
2025-05-07T20:33:15.2268968Z             x1 = x1.contiguous()
2025-05-07T20:33:15.2269198Z     
2025-05-07T20:33:15.2269379Z         if scale_ub is not None:
2025-05-07T20:33:15.2269648Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:15.2269965Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:15.2270269Z             )
2025-05-07T20:33:15.2270461Z         else:
2025-05-07T20:33:15.2270778Z             scale_ub_tensor = None
2025-05-07T20:33:15.2271022Z     
2025-05-07T20:33:15.2271245Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:15.2271547Z             op = silu_mul_quant
2025-05-07T20:33:15.2271781Z             if compiled:
2025-05-07T20:33:15.2272020Z                 op = torch.compile(op)
2025-05-07T20:33:15.2272311Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:15.2272567Z     
2025-05-07T20:33:15.2272749Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:15.2272906Z 
2025-05-07T20:33:15.2273004Z moe/activation_test.py:117: 
2025-05-07T20:33:15.2273282Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.2273605Z moe/activation_test.py:115: in fn
2025-05-07T20:33:15.2273953Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:15.2274494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:15.2275066Z     return fn(*args, **kwargs)
2025-05-07T20:33:15.2275737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:15.2276402Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:15.2276920Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:15.2277586Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:15.2278234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:15.2278747Z     kernel = self.compile(
2025-05-07T20:33:15.2279287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:15.2279930Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:15.2280319Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.2280545Z 
2025-05-07T20:33:15.2280744Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d94b4410>
2025-05-07T20:33:15.2281802Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:15.2283152Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d8934680>}
2025-05-07T20:33:15.2284465Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:15.2285546Z context = <triton._C.libtriton.ir.context object at 0x7f38d99d45b0>
2025-05-07T20:33:15.2285828Z 
2025-05-07T20:33:15.2285994Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:15.2286505Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:15.2286965Z                            module_map=module_map)
2025-05-07T20:33:15.2287322Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:15.2287667Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:15.2287917Z E       ^
2025-05-07T20:33:15.2288368Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:15.2288804Z 
2025-05-07T20:33:15.2289213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:15.2289722Z 
2025-05-07T20:33:15.2289818Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:15.2290223Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:15.2290664Z     T=1,
2025-05-07T20:33:15.2290882Z     D=5120,
2025-05-07T20:33:15.2291072Z     scale_ub=None,
2025-05-07T20:33:15.2291286Z     contiguous=False,
2025-05-07T20:33:15.2291499Z     compiled=True,
2025-05-07T20:33:15.2291699Z )
2025-05-07T20:33:15.2292004Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:15.2292467Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:15.2292731Z 
2025-05-07T20:33:15.2292806Z     @given(
2025-05-07T20:33:15.2293025Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:15.2293319Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:15.2293617Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:15.2294008Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:15.2294326Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:15.2294592Z     )
2025-05-07T20:33:15.2294941Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:15.2295380Z     def test_silu_mul_quant(
2025-05-07T20:33:15.2295610Z         self,
2025-05-07T20:33:15.2295801Z         T: int,
2025-05-07T20:33:15.2295995Z         D: int,
2025-05-07T20:33:15.2296210Z         scale_ub: Optional[float],
2025-05-07T20:33:15.2296475Z         contiguous: bool,
2025-05-07T20:33:15.2296712Z         compiled: bool,
2025-05-07T20:33:15.2296920Z     ) -> None:
2025-05-07T20:33:15.2297125Z         torch.manual_seed(2025)
2025-05-07T20:33:15.2297382Z     
2025-05-07T20:33:15.2297650Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:15.2297978Z     
2025-05-07T20:33:15.2298163Z         x_sign = torch.sign(x)
2025-05-07T20:33:15.2298454Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:15.2298753Z         x = x_sign * x_clamp
2025-05-07T20:33:15.2298994Z         x0 = x[:, :D]
2025-05-07T20:33:15.2299208Z         x1 = x[:, D:]
2025-05-07T20:33:15.2299413Z     
2025-05-07T20:33:15.2299593Z         if contiguous:
2025-05-07T20:33:15.2299814Z             x0 = x0.contiguous()
2025-05-07T20:33:15.2305671Z             x1 = x1.contiguous()
2025-05-07T20:33:15.2305942Z     
2025-05-07T20:33:15.2306134Z         if scale_ub is not None:
2025-05-07T20:33:15.2306409Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:15.2306736Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:15.2307038Z             )
2025-05-07T20:33:15.2307227Z         else:
2025-05-07T20:33:15.2307482Z             scale_ub_tensor = None
2025-05-07T20:33:15.2307731Z     
2025-05-07T20:33:15.2307957Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:15.2308262Z             op = silu_mul_quant
2025-05-07T20:33:15.2308513Z             if compiled:
2025-05-07T20:33:15.2308749Z                 op = torch.compile(op)
2025-05-07T20:33:15.2309034Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:15.2309375Z     
2025-05-07T20:33:15.2309565Z         y_fp8, y_scale = fn()
2025-05-07T20:33:15.2309847Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:15.2310126Z     
2025-05-07T20:33:15.2310356Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:15.2310680Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:15.2310963Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:15.2311270Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:15.2311617Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:15.2311910Z     
2025-05-07T20:33:15.2312100Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:15.2312294Z 
2025-05-07T20:33:15.2312389Z moe/activation_test.py:126: 
2025-05-07T20:33:15.2312683Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.2313000Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:15.2313322Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:15.2314194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:15.2314919Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:15.2315460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:15.2316132Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:15.2316818Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:15.2317522Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:15.2318285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:15.2318904Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:15.2319494Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:15.2319995Z     fn()
2025-05-07T20:33:15.2320489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:15.2321052Z     self.fn.run(
2025-05-07T20:33:15.2321503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:15.2322018Z     kernel = self.compile(
2025-05-07T20:33:15.2322563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:15.2323194Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:15.2323572Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.2323794Z 
2025-05-07T20:33:15.2323995Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d94b6b10>
2025-05-07T20:33:15.2325053Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:15.2326392Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d892ad40>}
2025-05-07T20:33:15.2327694Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:15.2328691Z context = <triton._C.libtriton.ir.context object at 0x7f38d8626330>
2025-05-07T20:33:15.2328980Z 
2025-05-07T20:33:15.2329139Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:15.2329693Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:15.2330151Z                            module_map=module_map)
2025-05-07T20:33:15.2330504Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:15.2330853Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:15.2331102Z E       ^
2025-05-07T20:33:15.2331553Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:15.2331991Z 
2025-05-07T20:33:15.2332398Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:15.3756715Z 
2025-05-07T20:33:15.3757041Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:15.3757454Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:15.3757919Z     T=1,
2025-05-07T20:33:15.3758186Z     D=5120,
2025-05-07T20:33:15.3758447Z     scale_ub=None,
2025-05-07T20:33:15.3758725Z     contiguous=True,
2025-05-07T20:33:15.3759222Z     compiled=False,
2025-05-07T20:33:15.3759432Z )
2025-05-07T20:33:15.3759738Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:15.3760212Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:15.3760474Z 
2025-05-07T20:33:15.3760562Z     @given(
2025-05-07T20:33:15.3760786Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:15.3761086Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:15.3761386Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:15.3761717Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:15.3762039Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:15.3762397Z     )
2025-05-07T20:33:15.3762781Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:15.3763239Z     def test_silu_mul_quant(
2025-05-07T20:33:15.3763478Z         self,
2025-05-07T20:33:15.3763665Z         T: int,
2025-05-07T20:33:15.3763875Z         D: int,
2025-05-07T20:33:15.3764088Z         scale_ub: Optional[float],
2025-05-07T20:33:15.3764347Z         contiguous: bool,
2025-05-07T20:33:15.3764586Z         compiled: bool,
2025-05-07T20:33:15.3764805Z     ) -> None:
2025-05-07T20:33:15.3765018Z         torch.manual_seed(2025)
2025-05-07T20:33:15.3765267Z     
2025-05-07T20:33:15.3765541Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:15.3765872Z     
2025-05-07T20:33:15.3766067Z         x_sign = torch.sign(x)
2025-05-07T20:33:15.3766353Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:15.3766650Z         x = x_sign * x_clamp
2025-05-07T20:33:15.3766890Z         x0 = x[:, :D]
2025-05-07T20:33:15.3767110Z         x1 = x[:, D:]
2025-05-07T20:33:15.3767323Z     
2025-05-07T20:33:15.3767504Z         if contiguous:
2025-05-07T20:33:15.3767740Z             x0 = x0.contiguous()
2025-05-07T20:33:15.3768006Z             x1 = x1.contiguous()
2025-05-07T20:33:15.3768245Z     
2025-05-07T20:33:15.3768446Z         if scale_ub is not None:
2025-05-07T20:33:15.3768723Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:15.3769055Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:15.3769368Z             )
2025-05-07T20:33:15.3769564Z         else:
2025-05-07T20:33:15.3769772Z             scale_ub_tensor = None
2025-05-07T20:33:15.3770025Z     
2025-05-07T20:33:15.3770266Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:15.3770569Z             op = silu_mul_quant
2025-05-07T20:33:15.3770825Z             if compiled:
2025-05-07T20:33:15.3771073Z                 op = torch.compile(op)
2025-05-07T20:33:15.3771362Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:15.3771636Z     
2025-05-07T20:33:15.3771826Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:15.3771990Z 
2025-05-07T20:33:15.3772165Z moe/activation_test.py:117: 
2025-05-07T20:33:15.3772454Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.3772786Z moe/activation_test.py:115: in fn
2025-05-07T20:33:15.3773065Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:15.3773736Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:15.3774410Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:15.3774954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:15.3775639Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:15.3776286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:15.3776816Z     kernel = self.compile(
2025-05-07T20:33:15.3777366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:15.3778090Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:15.3778480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.3778712Z 
2025-05-07T20:33:15.3778916Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d8ff7540>
2025-05-07T20:33:15.3779970Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:15.3781316Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d9226660>}
2025-05-07T20:33:15.3782664Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:15.3783674Z context = <triton._C.libtriton.ir.context object at 0x7f38d851c2f0>
2025-05-07T20:33:15.3783958Z 
2025-05-07T20:33:15.3784131Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:15.3784638Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:15.3785122Z                            module_map=module_map)
2025-05-07T20:33:15.3785505Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:15.3785858Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:15.3786107Z E       ^
2025-05-07T20:33:15.3786558Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:15.3787000Z 
2025-05-07T20:33:15.3787522Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:15.3788023Z 
2025-05-07T20:33:15.3788130Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:15.3788530Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:15.3788928Z     T=128,
2025-05-07T20:33:15.3789110Z     D=5120,
2025-05-07T20:33:15.3789290Z     scale_ub=None,
2025-05-07T20:33:15.3789499Z     contiguous=False,
2025-05-07T20:33:15.3789715Z     compiled=True,
2025-05-07T20:33:15.3789905Z )
2025-05-07T20:33:15.3790221Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:15.3790701Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:15.3790959Z 
2025-05-07T20:33:15.3791038Z     @given(
2025-05-07T20:33:15.3791254Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:15.3791561Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:15.3791856Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:15.3792224Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:15.3792547Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:15.3792824Z     )
2025-05-07T20:33:15.3793157Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:15.3793592Z     def test_silu_mul_quant(
2025-05-07T20:33:15.3793823Z         self,
2025-05-07T20:33:15.3794005Z         T: int,
2025-05-07T20:33:15.3794198Z         D: int,
2025-05-07T20:33:15.3794407Z         scale_ub: Optional[float],
2025-05-07T20:33:15.3794674Z         contiguous: bool,
2025-05-07T20:33:15.3794906Z         compiled: bool,
2025-05-07T20:33:15.3795127Z     ) -> None:
2025-05-07T20:33:15.3795336Z         torch.manual_seed(2025)
2025-05-07T20:33:15.3795568Z     
2025-05-07T20:33:15.3795834Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:15.3796173Z     
2025-05-07T20:33:15.3796354Z         x_sign = torch.sign(x)
2025-05-07T20:33:15.3796644Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:15.3796994Z         x = x_sign * x_clamp
2025-05-07T20:33:15.3797289Z         x0 = x[:, :D]
2025-05-07T20:33:15.3797501Z         x1 = x[:, D:]
2025-05-07T20:33:15.3797703Z     
2025-05-07T20:33:15.3797876Z         if contiguous:
2025-05-07T20:33:15.3798105Z             x0 = x0.contiguous()
2025-05-07T20:33:15.3798357Z             x1 = x1.contiguous()
2025-05-07T20:33:15.3798582Z     
2025-05-07T20:33:15.3798767Z         if scale_ub is not None:
2025-05-07T20:33:15.3799030Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:15.3799351Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:15.3799647Z             )
2025-05-07T20:33:15.3799835Z         else:
2025-05-07T20:33:15.3800043Z             scale_ub_tensor = None
2025-05-07T20:33:15.3800337Z     
2025-05-07T20:33:15.3800556Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:15.3800858Z             op = silu_mul_quant
2025-05-07T20:33:15.3801107Z             if compiled:
2025-05-07T20:33:15.3801350Z                 op = torch.compile(op)
2025-05-07T20:33:15.3801638Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:15.3801906Z     
2025-05-07T20:33:15.3802093Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:15.3802253Z 
2025-05-07T20:33:15.3802349Z moe/activation_test.py:117: 
2025-05-07T20:33:15.3802635Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.3802962Z moe/activation_test.py:115: in fn
2025-05-07T20:33:15.3803235Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:15.3803784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:15.3804326Z     return fn(*args, **kwargs)
2025-05-07T20:33:15.3804983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:15.3805819Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:15.3806478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:15.3807148Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:15.3807792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:15.3808310Z     kernel = self.compile(
2025-05-07T20:33:15.3808843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:15.3809484Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:15.3809872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.3810103Z 
2025-05-07T20:33:15.3810309Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d8940e10>
2025-05-07T20:33:15.3811428Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:15.3812774Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d892bb00>}
2025-05-07T20:33:15.3814136Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:15.3815156Z context = <triton._C.libtriton.ir.context object at 0x7f38d86a4370>
2025-05-07T20:33:15.3815467Z 
2025-05-07T20:33:15.3815629Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:15.3816143Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:15.3816594Z                            module_map=module_map)
2025-05-07T20:33:15.3817038Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:15.3817492Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:15.3817747Z E       ^
2025-05-07T20:33:15.3818198Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:15.3818642Z 
2025-05-07T20:33:15.3819060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:15.3819560Z 
2025-05-07T20:33:15.3819672Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:15.3820077Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:15.3820465Z     T=128,
2025-05-07T20:33:15.3820718Z     D=7168,
2025-05-07T20:33:15.3820917Z     scale_ub=1200.0,
2025-05-07T20:33:15.3821149Z     contiguous=False,
2025-05-07T20:33:15.3821388Z     compiled=False,
2025-05-07T20:33:15.5399794Z )
2025-05-07T20:33:15.5400372Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:15.5400936Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:15.5401214Z 
2025-05-07T20:33:15.5401302Z     @given(
2025-05-07T20:33:15.5401535Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:15.5401862Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:15.5402160Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:15.5402483Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:15.5402800Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:15.5403079Z     )
2025-05-07T20:33:15.5403414Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:15.5403868Z     def test_silu_mul_quant(
2025-05-07T20:33:15.5404107Z         self,
2025-05-07T20:33:15.5404295Z         T: int,
2025-05-07T20:33:15.5404481Z         D: int,
2025-05-07T20:33:15.5404689Z         scale_ub: Optional[float],
2025-05-07T20:33:15.5404945Z         contiguous: bool,
2025-05-07T20:33:15.5405181Z         compiled: bool,
2025-05-07T20:33:15.5405398Z     ) -> None:
2025-05-07T20:33:15.5405597Z         torch.manual_seed(2025)
2025-05-07T20:33:15.5405832Z     
2025-05-07T20:33:15.5406096Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:15.5406428Z     
2025-05-07T20:33:15.5406618Z         x_sign = torch.sign(x)
2025-05-07T20:33:15.5406935Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:15.5407235Z         x = x_sign * x_clamp
2025-05-07T20:33:15.5407459Z         x0 = x[:, :D]
2025-05-07T20:33:15.5407658Z         x1 = x[:, D:]
2025-05-07T20:33:15.5407845Z     
2025-05-07T20:33:15.5408014Z         if contiguous:
2025-05-07T20:33:15.5408228Z             x0 = x0.contiguous()
2025-05-07T20:33:15.5408472Z             x1 = x1.contiguous()
2025-05-07T20:33:15.5408693Z     
2025-05-07T20:33:15.5408989Z         if scale_ub is not None:
2025-05-07T20:33:15.5409262Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:15.5409590Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:15.5409887Z             )
2025-05-07T20:33:15.5410064Z         else:
2025-05-07T20:33:15.5410263Z             scale_ub_tensor = None
2025-05-07T20:33:15.5410505Z     
2025-05-07T20:33:15.5410722Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:15.5411028Z             op = silu_mul_quant
2025-05-07T20:33:15.5411268Z             if compiled:
2025-05-07T20:33:15.5411502Z                 op = torch.compile(op)
2025-05-07T20:33:15.5411792Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:15.5412059Z     
2025-05-07T20:33:15.5412234Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:15.5412400Z 
2025-05-07T20:33:15.5412495Z moe/activation_test.py:117: 
2025-05-07T20:33:15.5412776Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.5413102Z moe/activation_test.py:115: in fn
2025-05-07T20:33:15.5413537Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:15.5414219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:15.5414892Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:15.5415418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:15.5416091Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:15.5416740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:15.5417260Z     kernel = self.compile(
2025-05-07T20:33:15.5417796Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:15.5418510Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:15.5418909Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.5419135Z 
2025-05-07T20:33:15.5419336Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d98347d0>
2025-05-07T20:33:15.5420398Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:15.5421762Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d8d66200>}
2025-05-07T20:33:15.5423077Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:15.5424086Z context = <triton._C.libtriton.ir.context object at 0x7f38d8c0faf0>
2025-05-07T20:33:15.5424368Z 
2025-05-07T20:33:15.5424532Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:15.5425063Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:15.5425547Z                            module_map=module_map)
2025-05-07T20:33:15.5425901Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:15.5426243Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:15.5426489Z E       ^
2025-05-07T20:33:15.5426935Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:15.5427372Z 
2025-05-07T20:33:15.5427847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:15.5428354Z 
2025-05-07T20:33:15.5428450Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:15.5428895Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:15.5429297Z     T=128,
2025-05-07T20:33:15.5429472Z     D=5120,
2025-05-07T20:33:15.5429653Z     scale_ub=None,
2025-05-07T20:33:15.5429859Z     contiguous=False,
2025-05-07T20:33:15.5430069Z     compiled=False,
2025-05-07T20:33:15.5430265Z )
2025-05-07T20:33:15.5430570Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:15.5431040Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:15.5431302Z 
2025-05-07T20:33:15.5431372Z     @given(
2025-05-07T20:33:15.5431591Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:15.5431895Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:15.5432186Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:15.5432515Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:15.5432829Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:15.5433099Z     )
2025-05-07T20:33:15.5433576Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:15.5434010Z     def test_silu_mul_quant(
2025-05-07T20:33:15.5434230Z         self,
2025-05-07T20:33:15.5434411Z         T: int,
2025-05-07T20:33:15.5434594Z         D: int,
2025-05-07T20:33:15.5434796Z         scale_ub: Optional[float],
2025-05-07T20:33:15.5435055Z         contiguous: bool,
2025-05-07T20:33:15.5435284Z         compiled: bool,
2025-05-07T20:33:15.5435489Z     ) -> None:
2025-05-07T20:33:15.5435690Z         torch.manual_seed(2025)
2025-05-07T20:33:15.5435917Z     
2025-05-07T20:33:15.5436174Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:15.5436501Z     
2025-05-07T20:33:15.5436680Z         x_sign = torch.sign(x)
2025-05-07T20:33:15.5437002Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:15.5437296Z         x = x_sign * x_clamp
2025-05-07T20:33:15.5437521Z         x0 = x[:, :D]
2025-05-07T20:33:15.5437720Z         x1 = x[:, D:]
2025-05-07T20:33:15.5437911Z     
2025-05-07T20:33:15.5438074Z         if contiguous:
2025-05-07T20:33:15.5438291Z             x0 = x0.contiguous()
2025-05-07T20:33:15.5438535Z             x1 = x1.contiguous()
2025-05-07T20:33:15.5438763Z     
2025-05-07T20:33:15.5438943Z         if scale_ub is not None:
2025-05-07T20:33:15.5439194Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:15.5439519Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:15.5439823Z             )
2025-05-07T20:33:15.5440002Z         else:
2025-05-07T20:33:15.5440619Z             scale_ub_tensor = None
2025-05-07T20:33:15.5440865Z     
2025-05-07T20:33:15.5441083Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:15.5441384Z             op = silu_mul_quant
2025-05-07T20:33:15.5441620Z             if compiled:
2025-05-07T20:33:15.5441854Z                 op = torch.compile(op)
2025-05-07T20:33:15.5442141Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:15.5442401Z     
2025-05-07T20:33:15.5442585Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:15.5442746Z 
2025-05-07T20:33:15.5442846Z moe/activation_test.py:117: 
2025-05-07T20:33:15.5443136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.5443456Z moe/activation_test.py:115: in fn
2025-05-07T20:33:15.5443721Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:15.5444402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:15.5445073Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:15.5445605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:15.5446286Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:15.5447018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:15.5447540Z     kernel = self.compile(
2025-05-07T20:33:15.5448077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:15.5454474Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:15.5454874Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.5455127Z 
2025-05-07T20:33:15.5455359Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d8934af0>
2025-05-07T20:33:15.5456423Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:15.5457789Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d8935940>}
2025-05-07T20:33:15.5459304Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:15.5460304Z context = <triton._C.libtriton.ir.context object at 0x7f38d873ba30>
2025-05-07T20:33:15.5460583Z 
2025-05-07T20:33:15.5460748Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:15.5461252Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:15.5461706Z                            module_map=module_map)
2025-05-07T20:33:15.5462058Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:15.5462475Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:15.5462732Z E       ^
2025-05-07T20:33:15.5463192Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:15.5463639Z 
2025-05-07T20:33:15.5464075Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:15.5464575Z 
2025-05-07T20:33:15.5464675Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:15.5465073Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:15.5465461Z     T=128,
2025-05-07T20:33:15.5465632Z     D=5120,
2025-05-07T20:33:15.5465811Z     scale_ub=1200.0,
2025-05-07T20:33:15.5466022Z     contiguous=True,
2025-05-07T20:33:15.5466225Z     compiled=False,
2025-05-07T20:33:15.5466421Z )
2025-05-07T20:33:15.5466723Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:15.5467201Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:15.5467542Z 
2025-05-07T20:33:15.5467618Z     @given(
2025-05-07T20:33:15.5467842Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:15.5468150Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:15.5468444Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:15.5468766Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:15.5469087Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:15.5469358Z     )
2025-05-07T20:33:15.5469695Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:15.5470121Z     def test_silu_mul_quant(
2025-05-07T20:33:15.5470352Z         self,
2025-05-07T20:33:15.5470529Z         T: int,
2025-05-07T20:33:15.5470712Z         D: int,
2025-05-07T20:33:15.5470916Z         scale_ub: Optional[float],
2025-05-07T20:33:15.5471167Z         contiguous: bool,
2025-05-07T20:33:15.5471393Z         compiled: bool,
2025-05-07T20:33:15.5471612Z     ) -> None:
2025-05-07T20:33:15.5471809Z         torch.manual_seed(2025)
2025-05-07T20:33:15.5472046Z     
2025-05-07T20:33:15.5472354Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:15.5472685Z     
2025-05-07T20:33:15.5472868Z         x_sign = torch.sign(x)
2025-05-07T20:33:15.5473146Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:15.5473435Z         x = x_sign * x_clamp
2025-05-07T20:33:15.5473660Z         x0 = x[:, :D]
2025-05-07T20:33:15.5473864Z         x1 = x[:, D:]
2025-05-07T20:33:15.5474052Z     
2025-05-07T20:33:15.5474221Z         if contiguous:
2025-05-07T20:33:15.5474436Z             x0 = x0.contiguous()
2025-05-07T20:33:15.5474675Z             x1 = x1.contiguous()
2025-05-07T20:33:15.5474904Z     
2025-05-07T20:33:15.5475081Z         if scale_ub is not None:
2025-05-07T20:33:15.5475343Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:15.5475663Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:15.5475959Z             )
2025-05-07T20:33:15.5476136Z         else:
2025-05-07T20:33:15.5476330Z             scale_ub_tensor = None
2025-05-07T20:33:15.5476567Z     
2025-05-07T20:33:15.5476895Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:15.5477192Z             op = silu_mul_quant
2025-05-07T20:33:15.5477429Z             if compiled:
2025-05-07T20:33:15.5477665Z                 op = torch.compile(op)
2025-05-07T20:33:15.5477941Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:15.5478195Z     
2025-05-07T20:33:15.5478372Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:15.5478529Z 
2025-05-07T20:33:15.5478619Z moe/activation_test.py:117: 
2025-05-07T20:33:15.5478898Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.5479213Z moe/activation_test.py:115: in fn
2025-05-07T20:33:15.5479477Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:15.5480229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:15.5480901Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:15.5481425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:15.5482098Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:15.5482751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:15.5483273Z     kernel = self.compile(
2025-05-07T20:33:15.5483811Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:15.5484450Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:15.5484826Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.5485066Z 
2025-05-07T20:33:15.5485304Z self = <triton.compiler.compiler.ASTSource object at 0x7f38fef01520>
2025-05-07T20:33:15.5486366Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:15.5487706Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d872cc20>}
2025-05-07T20:33:15.5489055Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:15.5490209Z context = <triton._C.libtriton.ir.context object at 0x7f38d8c1dc30>
2025-05-07T20:33:15.5490595Z 
2025-05-07T20:33:15.5490816Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:15.5491488Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:15.5492038Z                            module_map=module_map)
2025-05-07T20:33:15.5492401Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:15.5492750Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:15.5492990Z E       ^
2025-05-07T20:33:15.5493442Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:15.5493877Z 
2025-05-07T20:33:15.5494293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:15.7072475Z 
2025-05-07T20:33:15.7072860Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:15.7073380Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:15.7073966Z     T=1,
2025-05-07T20:33:15.7074230Z     D=7168,
2025-05-07T20:33:15.7074479Z     scale_ub=1200.0,
2025-05-07T20:33:15.7074799Z     contiguous=True,
2025-05-07T20:33:15.7075070Z     compiled=True,
2025-05-07T20:33:15.7075327Z )
2025-05-07T20:33:15.7075819Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:15.7076373Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:15.7076648Z 
2025-05-07T20:33:15.7076724Z     @given(
2025-05-07T20:33:15.7076953Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:15.7077259Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:15.7077560Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:15.7077883Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:15.7078204Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:15.7078479Z     )
2025-05-07T20:33:15.7078826Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:15.7079336Z     def test_silu_mul_quant(
2025-05-07T20:33:15.7079575Z         self,
2025-05-07T20:33:15.7079756Z         T: int,
2025-05-07T20:33:15.7079947Z         D: int,
2025-05-07T20:33:15.7080166Z         scale_ub: Optional[float],
2025-05-07T20:33:15.7080431Z         contiguous: bool,
2025-05-07T20:33:15.7080666Z         compiled: bool,
2025-05-07T20:33:15.7080889Z     ) -> None:
2025-05-07T20:33:15.7081094Z         torch.manual_seed(2025)
2025-05-07T20:33:15.7081335Z     
2025-05-07T20:33:15.7081602Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:15.7081925Z     
2025-05-07T20:33:15.7082105Z         x_sign = torch.sign(x)
2025-05-07T20:33:15.7082383Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:15.7082673Z         x = x_sign * x_clamp
2025-05-07T20:33:15.7082910Z         x0 = x[:, :D]
2025-05-07T20:33:15.7083119Z         x1 = x[:, D:]
2025-05-07T20:33:15.7083321Z     
2025-05-07T20:33:15.7083501Z         if contiguous:
2025-05-07T20:33:15.7083730Z             x0 = x0.contiguous()
2025-05-07T20:33:15.7083981Z             x1 = x1.contiguous()
2025-05-07T20:33:15.7084206Z     
2025-05-07T20:33:15.7084391Z         if scale_ub is not None:
2025-05-07T20:33:15.7084658Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:15.7084974Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:15.7085276Z             )
2025-05-07T20:33:15.7085461Z         else:
2025-05-07T20:33:15.7085653Z             scale_ub_tensor = None
2025-05-07T20:33:15.7085893Z     
2025-05-07T20:33:15.7086110Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:15.7086405Z             op = silu_mul_quant
2025-05-07T20:33:15.7086639Z             if compiled:
2025-05-07T20:33:15.7086876Z                 op = torch.compile(op)
2025-05-07T20:33:15.7087153Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:15.7087414Z     
2025-05-07T20:33:15.7087602Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:15.7087762Z 
2025-05-07T20:33:15.7087859Z moe/activation_test.py:117: 
2025-05-07T20:33:15.7088137Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.7088540Z moe/activation_test.py:115: in fn
2025-05-07T20:33:15.7088812Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:15.7089359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:15.7089902Z     return fn(*args, **kwargs)
2025-05-07T20:33:15.7090555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:15.7091211Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:15.7091741Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:15.7092401Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:15.7093045Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:15.7093558Z     kernel = self.compile(
2025-05-07T20:33:15.7094149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:15.7094834Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:15.7095225Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.7095443Z 
2025-05-07T20:33:15.7095642Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d8c142d0>
2025-05-07T20:33:15.7096720Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:15.7098085Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d872dee0>}
2025-05-07T20:33:15.7099476Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:15.7100473Z context = <triton._C.libtriton.ir.context object at 0x7f38d8c7feb0>
2025-05-07T20:33:15.7100755Z 
2025-05-07T20:33:15.7100913Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:15.7101419Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:15.7101873Z                            module_map=module_map)
2025-05-07T20:33:15.7102222Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:15.7102576Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:15.7102826Z E       ^
2025-05-07T20:33:15.7103269Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:15.7103716Z 
2025-05-07T20:33:15.7104143Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:15.7104652Z 
2025-05-07T20:33:15.7104748Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:15.7105168Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:15.7105581Z     T=1,
2025-05-07T20:33:15.7105754Z     D=7168,
2025-05-07T20:33:15.7105940Z     scale_ub=1200.0,
2025-05-07T20:33:15.7106162Z     contiguous=False,
2025-05-07T20:33:15.7106372Z     compiled=True,
2025-05-07T20:33:15.7106557Z )
2025-05-07T20:33:15.7106863Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:15.7107335Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:15.7107657Z 
2025-05-07T20:33:15.7107727Z     @given(
2025-05-07T20:33:15.7107937Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:15.7108231Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:15.7108585Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:15.7108908Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:15.7109222Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:15.7109499Z     )
2025-05-07T20:33:15.7109831Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:15.7110267Z     def test_silu_mul_quant(
2025-05-07T20:33:15.7110497Z         self,
2025-05-07T20:33:15.7110682Z         T: int,
2025-05-07T20:33:15.7110861Z         D: int,
2025-05-07T20:33:15.7111076Z         scale_ub: Optional[float],
2025-05-07T20:33:15.7111338Z         contiguous: bool,
2025-05-07T20:33:15.7111563Z         compiled: bool,
2025-05-07T20:33:15.7111774Z     ) -> None:
2025-05-07T20:33:15.7111978Z         torch.manual_seed(2025)
2025-05-07T20:33:15.7112210Z     
2025-05-07T20:33:15.7112472Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:15.7112798Z     
2025-05-07T20:33:15.7112983Z         x_sign = torch.sign(x)
2025-05-07T20:33:15.7113321Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:15.7113659Z         x = x_sign * x_clamp
2025-05-07T20:33:15.7113886Z         x0 = x[:, :D]
2025-05-07T20:33:15.7114095Z         x1 = x[:, D:]
2025-05-07T20:33:15.7114288Z     
2025-05-07T20:33:15.7114456Z         if contiguous:
2025-05-07T20:33:15.7114674Z             x0 = x0.contiguous()
2025-05-07T20:33:15.7114919Z             x1 = x1.contiguous()
2025-05-07T20:33:15.7115140Z     
2025-05-07T20:33:15.7115321Z         if scale_ub is not None:
2025-05-07T20:33:15.7115583Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:15.7115911Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:15.7116200Z             )
2025-05-07T20:33:15.7116387Z         else:
2025-05-07T20:33:15.7116634Z             scale_ub_tensor = None
2025-05-07T20:33:15.7116870Z     
2025-05-07T20:33:15.7117091Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:15.7117391Z             op = silu_mul_quant
2025-05-07T20:33:15.7117627Z             if compiled:
2025-05-07T20:33:15.7117862Z                 op = torch.compile(op)
2025-05-07T20:33:15.7118152Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:15.7118410Z     
2025-05-07T20:33:15.7118588Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:15.7118746Z 
2025-05-07T20:33:15.7118846Z moe/activation_test.py:117: 
2025-05-07T20:33:15.7119126Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.7119458Z moe/activation_test.py:115: in fn
2025-05-07T20:33:15.7119729Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:15.7120282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:15.7120826Z     return fn(*args, **kwargs)
2025-05-07T20:33:15.7121474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:15.7122145Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:15.7122673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:15.7123362Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:15.7124015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:15.7124539Z     kernel = self.compile(
2025-05-07T20:33:15.7125077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:15.7125713Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:15.7126105Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.7126328Z 
2025-05-07T20:33:15.7126537Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d8c14d50>
2025-05-07T20:33:15.7127640Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:15.7129031Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d872ec00>}
2025-05-07T20:33:15.7130344Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:15.7131391Z context = <triton._C.libtriton.ir.context object at 0x7f38d8cec0b0>
2025-05-07T20:33:15.7131667Z 
2025-05-07T20:33:15.7131824Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:15.7132331Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:15.7132870Z                            module_map=module_map)
2025-05-07T20:33:15.7133221Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:15.7133554Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:15.7133798Z E       ^
2025-05-07T20:33:15.7134245Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:15.7134688Z 
2025-05-07T20:33:15.7135127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:15.9244198Z 
2025-05-07T20:33:15.9244484Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:15.9244976Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:15.9245769Z     T=1,
2025-05-07T20:33:15.9245959Z     D=7168,
2025-05-07T20:33:15.9246173Z     scale_ub=None,
2025-05-07T20:33:15.9246375Z     contiguous=False,
2025-05-07T20:33:15.9246593Z     compiled=True,
2025-05-07T20:33:15.9246783Z )
2025-05-07T20:33:15.9247091Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:15.9247564Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:15.9247821Z 
2025-05-07T20:33:15.9247896Z     @given(
2025-05-07T20:33:15.9248114Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:15.9248425Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:15.9248718Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:15.9249040Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:15.9249359Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:15.9249631Z     )
2025-05-07T20:33:15.9249976Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:15.9250422Z     def test_silu_mul_quant(
2025-05-07T20:33:15.9250654Z         self,
2025-05-07T20:33:15.9250838Z         T: int,
2025-05-07T20:33:15.9251031Z         D: int,
2025-05-07T20:33:15.9251257Z         scale_ub: Optional[float],
2025-05-07T20:33:15.9251513Z         contiguous: bool,
2025-05-07T20:33:15.9251741Z         compiled: bool,
2025-05-07T20:33:15.9251957Z     ) -> None:
2025-05-07T20:33:15.9252161Z         torch.manual_seed(2025)
2025-05-07T20:33:15.9252406Z     
2025-05-07T20:33:15.9252671Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:15.9252998Z     
2025-05-07T20:33:15.9253180Z         x_sign = torch.sign(x)
2025-05-07T20:33:15.9253468Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:15.9253768Z         x = x_sign * x_clamp
2025-05-07T20:33:15.9254001Z         x0 = x[:, :D]
2025-05-07T20:33:15.9254212Z         x1 = x[:, D:]
2025-05-07T20:33:15.9254411Z     
2025-05-07T20:33:15.9254587Z         if contiguous:
2025-05-07T20:33:15.9254808Z             x0 = x0.contiguous()
2025-05-07T20:33:15.9255070Z             x1 = x1.contiguous()
2025-05-07T20:33:15.9255378Z     
2025-05-07T20:33:15.9255568Z         if scale_ub is not None:
2025-05-07T20:33:15.9255836Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:15.9256153Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:15.9256457Z             )
2025-05-07T20:33:15.9256645Z         else:
2025-05-07T20:33:15.9256846Z             scale_ub_tensor = None
2025-05-07T20:33:15.9257086Z     
2025-05-07T20:33:15.9257308Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:15.9257604Z             op = silu_mul_quant
2025-05-07T20:33:15.9257844Z             if compiled:
2025-05-07T20:33:15.9258083Z                 op = torch.compile(op)
2025-05-07T20:33:15.9258360Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:15.9258625Z     
2025-05-07T20:33:15.9258809Z         y_fp8, y_scale = fn()
2025-05-07T20:33:15.9259077Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:15.9259351Z     
2025-05-07T20:33:15.9259578Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:15.9260022Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:15.9260302Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:15.9260599Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:15.9260942Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:15.9261235Z     
2025-05-07T20:33:15.9261435Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:15.9261620Z 
2025-05-07T20:33:15.9261722Z moe/activation_test.py:126: 
2025-05-07T20:33:15.9262002Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.9262322Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:15.9262634Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:15.9263455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:15.9264183Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:15.9264728Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:15.9265405Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:15.9266079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:15.9266793Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:15.9267601Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:15.9268227Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:15.9268813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:15.9269315Z     fn()
2025-05-07T20:33:15.9269817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:15.9270385Z     self.fn.run(
2025-05-07T20:33:15.9270832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:15.9271344Z     kernel = self.compile(
2025-05-07T20:33:15.9271887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:15.9272516Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:15.9272898Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.9273122Z 
2025-05-07T20:33:15.9273325Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d9a11150>
2025-05-07T20:33:15.9274434Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:15.9275788Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d8518180>}
2025-05-07T20:33:15.9277130Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:15.9278129Z context = <triton._C.libtriton.ir.context object at 0x7f38d857ca70>
2025-05-07T20:33:15.9278406Z 
2025-05-07T20:33:15.9278571Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:15.9279091Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:15.9279548Z                            module_map=module_map)
2025-05-07T20:33:15.9279910Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:15.9280353Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:15.9280609Z E       ^
2025-05-07T20:33:15.9281055Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:15.9281491Z 
2025-05-07T20:33:15.9281905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:15.9282404Z 
2025-05-07T20:33:15.9282505Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:15.9282905Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:15.9283293Z     T=1,
2025-05-07T20:33:15.9283464Z     D=5120,
2025-05-07T20:33:15.9283647Z     scale_ub=1200.0,
2025-05-07T20:33:15.9283909Z     contiguous=False,
2025-05-07T20:33:15.9284124Z     compiled=True,
2025-05-07T20:33:15.9284312Z )
2025-05-07T20:33:15.9284620Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:15.9285122Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:15.9285414Z 
2025-05-07T20:33:15.9285490Z     @given(
2025-05-07T20:33:15.9285701Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:15.9286007Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:15.9286308Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:15.9286620Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:15.9286939Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:15.9287215Z     )
2025-05-07T20:33:15.9299327Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:15.9299847Z     def test_silu_mul_quant(
2025-05-07T20:33:15.9300109Z         self,
2025-05-07T20:33:15.9300318Z         T: int,
2025-05-07T20:33:15.9300515Z         D: int,
2025-05-07T20:33:15.9300758Z         scale_ub: Optional[float],
2025-05-07T20:33:15.9301028Z         contiguous: bool,
2025-05-07T20:33:15.9301263Z         compiled: bool,
2025-05-07T20:33:15.9301496Z     ) -> None:
2025-05-07T20:33:15.9301708Z         torch.manual_seed(2025)
2025-05-07T20:33:15.9301949Z     
2025-05-07T20:33:15.9302213Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:15.9302551Z     
2025-05-07T20:33:15.9302739Z         x_sign = torch.sign(x)
2025-05-07T20:33:15.9303026Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:15.9303337Z         x = x_sign * x_clamp
2025-05-07T20:33:15.9303579Z         x0 = x[:, :D]
2025-05-07T20:33:15.9303789Z         x1 = x[:, D:]
2025-05-07T20:33:15.9303993Z     
2025-05-07T20:33:15.9304175Z         if contiguous:
2025-05-07T20:33:15.9304401Z             x0 = x0.contiguous()
2025-05-07T20:33:15.9304657Z             x1 = x1.contiguous()
2025-05-07T20:33:15.9304898Z     
2025-05-07T20:33:15.9305080Z         if scale_ub is not None:
2025-05-07T20:33:15.9305432Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:15.9305770Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:15.9306075Z             )
2025-05-07T20:33:15.9306267Z         else:
2025-05-07T20:33:15.9306474Z             scale_ub_tensor = None
2025-05-07T20:33:15.9306724Z     
2025-05-07T20:33:15.9306953Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:15.9307261Z             op = silu_mul_quant
2025-05-07T20:33:15.9307597Z             if compiled:
2025-05-07T20:33:15.9307837Z                 op = torch.compile(op)
2025-05-07T20:33:15.9308130Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:15.9308403Z     
2025-05-07T20:33:15.9308589Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:15.9308759Z 
2025-05-07T20:33:15.9308856Z moe/activation_test.py:117: 
2025-05-07T20:33:15.9309157Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.9309481Z moe/activation_test.py:115: in fn
2025-05-07T20:33:15.9309762Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:15.9310382Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:15.9310976Z     return fn(*args, **kwargs)
2025-05-07T20:33:15.9311635Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:15.9312312Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:15.9312857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:15.9313529Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:15.9314255Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:15.9315057Z     kernel = self.compile(
2025-05-07T20:33:15.9319226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:15.9319879Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:15.9340750Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:15.9341131Z 
2025-05-07T20:33:15.9341435Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d86cabd0>
2025-05-07T20:33:15.9343175Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:15.9345500Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d8519300>}
2025-05-07T20:33:15.9347145Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:15.9348261Z context = <triton._C.libtriton.ir.context object at 0x7f38d82a78f0>
2025-05-07T20:33:15.9348542Z 
2025-05-07T20:33:15.9348711Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:15.9349224Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:15.9349682Z                            module_map=module_map)
2025-05-07T20:33:15.9350038Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:15.9350375Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:15.9350618Z E       ^
2025-05-07T20:33:15.9351072Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:15.9351512Z 
2025-05-07T20:33:15.9351930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:16.0739575Z 
2025-05-07T20:33:16.0739856Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:16.0740542Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:16.0740988Z     T=1,
2025-05-07T20:33:16.0741199Z     D=5120,
2025-05-07T20:33:16.0741415Z     scale_ub=1200.0,
2025-05-07T20:33:16.0741666Z     contiguous=False,
2025-05-07T20:33:16.0741915Z     compiled=False,
2025-05-07T20:33:16.0742149Z )
2025-05-07T20:33:16.0742487Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:16.0742997Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:16.0743278Z 
2025-05-07T20:33:16.0743374Z     @given(
2025-05-07T20:33:16.0743623Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:16.0743966Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:16.0744294Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:16.0744652Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:16.0745175Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:16.0745482Z     )
2025-05-07T20:33:16.0745860Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:16.0746316Z     def test_silu_mul_quant(
2025-05-07T20:33:16.0746571Z         self,
2025-05-07T20:33:16.0746780Z         T: int,
2025-05-07T20:33:16.0746979Z         D: int,
2025-05-07T20:33:16.0747198Z         scale_ub: Optional[float],
2025-05-07T20:33:16.0747565Z         contiguous: bool,
2025-05-07T20:33:16.0747801Z         compiled: bool,
2025-05-07T20:33:16.0748029Z     ) -> None:
2025-05-07T20:33:16.0748249Z         torch.manual_seed(2025)
2025-05-07T20:33:16.0748486Z     
2025-05-07T20:33:16.0748756Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:16.0749201Z     
2025-05-07T20:33:16.0749391Z         x_sign = torch.sign(x)
2025-05-07T20:33:16.0749686Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:16.0749999Z         x = x_sign * x_clamp
2025-05-07T20:33:16.0750237Z         x0 = x[:, :D]
2025-05-07T20:33:16.0750457Z         x1 = x[:, D:]
2025-05-07T20:33:16.0750669Z     
2025-05-07T20:33:16.0750849Z         if contiguous:
2025-05-07T20:33:16.0751082Z             x0 = x0.contiguous()
2025-05-07T20:33:16.0751342Z             x1 = x1.contiguous()
2025-05-07T20:33:16.0751583Z     
2025-05-07T20:33:16.0751781Z         if scale_ub is not None:
2025-05-07T20:33:16.0752061Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:16.0752395Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:16.0752708Z             )
2025-05-07T20:33:16.0752902Z         else:
2025-05-07T20:33:16.0753118Z             scale_ub_tensor = None
2025-05-07T20:33:16.0753383Z     
2025-05-07T20:33:16.0753642Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:16.0753993Z             op = silu_mul_quant
2025-05-07T20:33:16.0754265Z             if compiled:
2025-05-07T20:33:16.0754546Z                 op = torch.compile(op)
2025-05-07T20:33:16.0754888Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.0755212Z     
2025-05-07T20:33:16.0755447Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:16.0755634Z 
2025-05-07T20:33:16.0755750Z moe/activation_test.py:117: 
2025-05-07T20:33:16.0756083Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.0756470Z moe/activation_test.py:115: in fn
2025-05-07T20:33:16.0756791Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.0757629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:16.0758463Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:16.0759104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:16.0760015Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:16.0760691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:16.0761225Z     kernel = self.compile(
2025-05-07T20:33:16.0761784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:16.0762441Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:16.0762837Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.0763072Z 
2025-05-07T20:33:16.0763277Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d98952d0>
2025-05-07T20:33:16.0764427Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:16.0765851Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d851a020>}
2025-05-07T20:33:16.0767215Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:16.0768227Z context = <triton._C.libtriton.ir.context object at 0x7f38d88bd5b0>
2025-05-07T20:33:16.0768514Z 
2025-05-07T20:33:16.0768681Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:16.0769202Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:16.0769672Z                            module_map=module_map)
2025-05-07T20:33:16.0770079Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:16.0770439Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:16.0770704Z E       ^
2025-05-07T20:33:16.0771175Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:16.0771621Z 
2025-05-07T20:33:16.0772051Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:16.0772577Z 
2025-05-07T20:33:16.0772684Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:16.0773110Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:16.0773515Z     T=16384,
2025-05-07T20:33:16.0773708Z     D=5120,
2025-05-07T20:33:16.0773907Z     scale_ub=1200.0,
2025-05-07T20:33:16.0774131Z     contiguous=False,
2025-05-07T20:33:16.0774353Z     compiled=True,
2025-05-07T20:33:16.0774565Z )
2025-05-07T20:33:16.0774886Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:16.0775374Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:16.0775665Z 
2025-05-07T20:33:16.0775750Z     @given(
2025-05-07T20:33:16.0775998Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:16.0776313Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:16.0776626Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:16.0776961Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:16.0777288Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:16.0777586Z     )
2025-05-07T20:33:16.0777946Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:16.0778409Z     def test_silu_mul_quant(
2025-05-07T20:33:16.0778651Z         self,
2025-05-07T20:33:16.0778851Z         T: int,
2025-05-07T20:33:16.0779058Z         D: int,
2025-05-07T20:33:16.0779281Z         scale_ub: Optional[float],
2025-05-07T20:33:16.0779567Z         contiguous: bool,
2025-05-07T20:33:16.0779815Z         compiled: bool,
2025-05-07T20:33:16.0780043Z     ) -> None:
2025-05-07T20:33:16.0780321Z         torch.manual_seed(2025)
2025-05-07T20:33:16.0780570Z     
2025-05-07T20:33:16.0780849Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:16.0781201Z     
2025-05-07T20:33:16.0781403Z         x_sign = torch.sign(x)
2025-05-07T20:33:16.0781696Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:16.0782014Z         x = x_sign * x_clamp
2025-05-07T20:33:16.0782260Z         x0 = x[:, :D]
2025-05-07T20:33:16.0782476Z         x1 = x[:, D:]
2025-05-07T20:33:16.0782684Z     
2025-05-07T20:33:16.0782873Z         if contiguous:
2025-05-07T20:33:16.0783108Z             x0 = x0.contiguous()
2025-05-07T20:33:16.0783359Z             x1 = x1.contiguous()
2025-05-07T20:33:16.0783597Z     
2025-05-07T20:33:16.0783798Z         if scale_ub is not None:
2025-05-07T20:33:16.0784072Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:16.0784409Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:16.0784722Z             )
2025-05-07T20:33:16.0784914Z         else:
2025-05-07T20:33:16.0785212Z             scale_ub_tensor = None
2025-05-07T20:33:16.0785471Z     
2025-05-07T20:33:16.0785701Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:16.0786018Z             op = silu_mul_quant
2025-05-07T20:33:16.0786277Z             if compiled:
2025-05-07T20:33:16.0786528Z                 op = torch.compile(op)
2025-05-07T20:33:16.0786827Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.0787106Z     
2025-05-07T20:33:16.0787300Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:16.0787525Z 
2025-05-07T20:33:16.0787626Z moe/activation_test.py:117: 
2025-05-07T20:33:16.0787922Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.0788253Z moe/activation_test.py:115: in fn
2025-05-07T20:33:16.0788583Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.0789152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:16.0789705Z     return fn(*args, **kwargs)
2025-05-07T20:33:16.0790353Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:16.0791040Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:16.0791584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:16.0792266Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:16.0792923Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:16.0793454Z     kernel = self.compile(
2025-05-07T20:33:16.0794022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:16.0794675Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:16.0795069Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.0795302Z 
2025-05-07T20:33:16.0795521Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d814eed0>
2025-05-07T20:33:16.0796637Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:16.0798004Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d851b600>}
2025-05-07T20:33:16.0799336Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:16.0800448Z context = <triton._C.libtriton.ir.context object at 0x7f38d825e2f0>
2025-05-07T20:33:16.0800738Z 
2025-05-07T20:33:16.0800918Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:16.0801440Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:16.0801917Z                            module_map=module_map)
2025-05-07T20:33:16.0802296Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:16.0802648Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:16.0802916Z E       ^
2025-05-07T20:33:16.0803386Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:16.0803831Z 
2025-05-07T20:33:16.0804269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:16.0804801Z 
2025-05-07T20:33:16.0804906Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:16.0805321Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:16.0805812Z     T=2048,
2025-05-07T20:33:16.0805996Z     D=7168,
2025-05-07T20:33:16.0806199Z     scale_ub=1200.0,
2025-05-07T20:33:16.0806429Z     contiguous=False,
2025-05-07T20:33:16.0806654Z     compiled=True,
2025-05-07T20:33:16.2706770Z )
2025-05-07T20:33:16.2707531Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:16.2708352Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:16.2708799Z 
2025-05-07T20:33:16.2708926Z     @given(
2025-05-07T20:33:16.2709273Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:16.2709785Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:16.2710261Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:16.2711158Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:16.2711718Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:16.2712180Z     )
2025-05-07T20:33:16.2712748Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:16.2713480Z     def test_silu_mul_quant(
2025-05-07T20:33:16.2713867Z         self,
2025-05-07T20:33:16.2714165Z         T: int,
2025-05-07T20:33:16.2714473Z         D: int,
2025-05-07T20:33:16.2714819Z         scale_ub: Optional[float],
2025-05-07T20:33:16.2715251Z         contiguous: bool,
2025-05-07T20:33:16.2715631Z         compiled: bool,
2025-05-07T20:33:16.2716001Z     ) -> None:
2025-05-07T20:33:16.2716334Z         torch.manual_seed(2025)
2025-05-07T20:33:16.2716714Z     
2025-05-07T20:33:16.2717151Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:16.2717718Z     
2025-05-07T20:33:16.2718015Z         x_sign = torch.sign(x)
2025-05-07T20:33:16.2718497Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:16.2718990Z         x = x_sign * x_clamp
2025-05-07T20:33:16.2719377Z         x0 = x[:, :D]
2025-05-07T20:33:16.2719727Z         x1 = x[:, D:]
2025-05-07T20:33:16.2720062Z     
2025-05-07T20:33:16.2720344Z         if contiguous:
2025-05-07T20:33:16.2720710Z             x0 = x0.contiguous()
2025-05-07T20:33:16.2721118Z             x1 = x1.contiguous()
2025-05-07T20:33:16.2721489Z     
2025-05-07T20:33:16.2721787Z         if scale_ub is not None:
2025-05-07T20:33:16.2722218Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:16.2722729Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:16.2723220Z             )
2025-05-07T20:33:16.2723515Z         else:
2025-05-07T20:33:16.2723852Z             scale_ub_tensor = None
2025-05-07T20:33:16.2724252Z     
2025-05-07T20:33:16.2724624Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:16.2725136Z             op = silu_mul_quant
2025-05-07T20:33:16.2725537Z             if compiled:
2025-05-07T20:33:16.2725935Z                 op = torch.compile(op)
2025-05-07T20:33:16.2726534Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.2726901Z     
2025-05-07T20:33:16.2727172Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:16.2727398Z 
2025-05-07T20:33:16.2727543Z moe/activation_test.py:117: 
2025-05-07T20:33:16.2727944Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.2728424Z moe/activation_test.py:115: in fn
2025-05-07T20:33:16.2728831Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.2729734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:16.2730625Z     return fn(*args, **kwargs)
2025-05-07T20:33:16.2731675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:16.2732831Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:16.2733703Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:16.2734975Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:16.2736275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:16.2737234Z     kernel = self.compile(
2025-05-07T20:33:16.2738169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:16.2739330Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:16.2739988Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.2740744Z 
2025-05-07T20:33:16.2741074Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d8c15fd0>
2025-05-07T20:33:16.2742931Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:16.2745423Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d8268720>}
2025-05-07T20:33:16.2747759Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:16.2749489Z context = <triton._C.libtriton.ir.context object at 0x7f38d82dd9f0>
2025-05-07T20:33:16.2749993Z 
2025-05-07T20:33:16.2750257Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:16.2751063Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:16.2751810Z                            module_map=module_map)
2025-05-07T20:33:16.2752430Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:16.2753028Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:16.2753476Z E       ^
2025-05-07T20:33:16.2754287Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:16.2755093Z 
2025-05-07T20:33:16.2755838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:16.2756772Z 
2025-05-07T20:33:16.2756943Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:16.2757651Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:16.2758351Z     T=1,
2025-05-07T20:33:16.2758639Z     D=5120,
2025-05-07T20:33:16.2758954Z     scale_ub=None,
2025-05-07T20:33:16.2759318Z     contiguous=False,
2025-05-07T20:33:16.2759690Z     compiled=False,
2025-05-07T20:33:16.2760038Z )
2025-05-07T20:33:16.2760578Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:16.2761535Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:16.2761986Z 
2025-05-07T20:33:16.2762113Z     @given(
2025-05-07T20:33:16.2762479Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:16.2762988Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:16.2763476Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:16.2764003Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:16.2764560Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:16.2765033Z     )
2025-05-07T20:33:16.2765637Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:16.2766408Z     def test_silu_mul_quant(
2025-05-07T20:33:16.2766807Z         self,
2025-05-07T20:33:16.2767125Z         T: int,
2025-05-07T20:33:16.2767452Z         D: int,
2025-05-07T20:33:16.2767803Z         scale_ub: Optional[float],
2025-05-07T20:33:16.2768266Z         contiguous: bool,
2025-05-07T20:33:16.2768679Z         compiled: bool,
2025-05-07T20:33:16.2769051Z     ) -> None:
2025-05-07T20:33:16.2769621Z         torch.manual_seed(2025)
2025-05-07T20:33:16.2770039Z     
2025-05-07T20:33:16.2770474Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:16.2771065Z     
2025-05-07T20:33:16.2771374Z         x_sign = torch.sign(x)
2025-05-07T20:33:16.2771859Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:16.2772367Z         x = x_sign * x_clamp
2025-05-07T20:33:16.2772764Z         x0 = x[:, :D]
2025-05-07T20:33:16.2773126Z         x1 = x[:, D:]
2025-05-07T20:33:16.2773456Z     
2025-05-07T20:33:16.2773767Z         if contiguous:
2025-05-07T20:33:16.2774155Z             x0 = x0.contiguous()
2025-05-07T20:33:16.2774581Z             x1 = x1.contiguous()
2025-05-07T20:33:16.2774996Z     
2025-05-07T20:33:16.2775439Z         if scale_ub is not None:
2025-05-07T20:33:16.2775911Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:16.2776487Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:16.2777020Z             )
2025-05-07T20:33:16.2777334Z         else:
2025-05-07T20:33:16.2777698Z             scale_ub_tensor = None
2025-05-07T20:33:16.2778123Z     
2025-05-07T20:33:16.2778491Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:16.2779028Z             op = silu_mul_quant
2025-05-07T20:33:16.2779461Z             if compiled:
2025-05-07T20:33:16.2779877Z                 op = torch.compile(op)
2025-05-07T20:33:16.2793910Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.2794332Z     
2025-05-07T20:33:16.2794598Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:16.2794826Z 
2025-05-07T20:33:16.2794958Z moe/activation_test.py:117: 
2025-05-07T20:33:16.2795365Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.2795820Z moe/activation_test.py:115: in fn
2025-05-07T20:33:16.2796183Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.2797128Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:16.2798130Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:16.2798886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:16.2799854Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:16.2800826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:16.2801572Z     kernel = self.compile(
2025-05-07T20:33:16.2802356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:16.2803306Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:16.2803841Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.2804178Z 
2025-05-07T20:33:16.2804585Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d894ded0>
2025-05-07T20:33:16.2806219Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:16.2808464Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d8269120>}
2025-05-07T20:33:16.2810759Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:16.2812613Z context = <triton._C.libtriton.ir.context object at 0x7f38d821af70>
2025-05-07T20:33:16.2813112Z 
2025-05-07T20:33:16.2813411Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:16.2814404Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:16.2815281Z                            module_map=module_map)
2025-05-07T20:33:16.2815891Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:16.2816476Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:16.2816907Z E       ^
2025-05-07T20:33:16.2817717Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:16.2818530Z 
2025-05-07T20:33:16.2819284Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:16.2820200Z 
2025-05-07T20:33:16.2820373Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:16.2821167Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:16.2821853Z     T=4096,
2025-05-07T20:33:16.2822150Z     D=7168,
2025-05-07T20:33:16.2822459Z     scale_ub=1200.0,
2025-05-07T20:33:16.2822825Z     contiguous=False,
2025-05-07T20:33:16.2823181Z     compiled=False,
2025-05-07T20:33:16.2823511Z )
2025-05-07T20:33:16.2824036Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:16.2824887Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:16.2825364Z 
2025-05-07T20:33:16.2825483Z     @given(
2025-05-07T20:33:16.2825850Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:16.2826368Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:16.2826870Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:16.2827516Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:16.2828074Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:16.2828548Z     )
2025-05-07T20:33:16.2829139Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:16.2829911Z     def test_silu_mul_quant(
2025-05-07T20:33:16.2830300Z         self,
2025-05-07T20:33:16.2830669Z         T: int,
2025-05-07T20:33:16.2830988Z         D: int,
2025-05-07T20:33:16.2831350Z         scale_ub: Optional[float],
2025-05-07T20:33:16.2831791Z         contiguous: bool,
2025-05-07T20:33:16.2832181Z         compiled: bool,
2025-05-07T20:33:16.2832547Z     ) -> None:
2025-05-07T20:33:16.2832884Z         torch.manual_seed(2025)
2025-05-07T20:33:16.2833276Z     
2025-05-07T20:33:16.2833718Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:16.2834285Z     
2025-05-07T20:33:16.2834589Z         x_sign = torch.sign(x)
2025-05-07T20:33:16.2835061Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:16.2835571Z         x = x_sign * x_clamp
2025-05-07T20:33:16.2835958Z         x0 = x[:, :D]
2025-05-07T20:33:16.2836310Z         x1 = x[:, D:]
2025-05-07T20:33:16.2836641Z     
2025-05-07T20:33:16.2836925Z         if contiguous:
2025-05-07T20:33:16.2837378Z             x0 = x0.contiguous()
2025-05-07T20:33:16.2837821Z             x1 = x1.contiguous()
2025-05-07T20:33:16.2838208Z     
2025-05-07T20:33:16.2838514Z         if scale_ub is not None:
2025-05-07T20:33:16.2838963Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:16.2839506Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:16.2840022Z             )
2025-05-07T20:33:16.2840673Z         else:
2025-05-07T20:33:16.2841009Z             scale_ub_tensor = None
2025-05-07T20:33:16.2841423Z     
2025-05-07T20:33:16.2841797Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:16.2842312Z             op = silu_mul_quant
2025-05-07T20:33:16.2842722Z             if compiled:
2025-05-07T20:33:16.2843130Z                 op = torch.compile(op)
2025-05-07T20:33:16.2843618Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.2844072Z     
2025-05-07T20:33:16.2844373Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:16.2844656Z 
2025-05-07T20:33:16.2844821Z moe/activation_test.py:117: 
2025-05-07T20:33:16.2845559Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.2846117Z moe/activation_test.py:115: in fn
2025-05-07T20:33:16.2846583Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.2847792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:16.2849007Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:16.2849939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:16.2851135Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:16.2852289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:16.2853345Z     kernel = self.compile(
2025-05-07T20:33:16.2854296Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:16.2855492Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:16.2856172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.2856572Z 
2025-05-07T20:33:16.2856912Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d9d933d0>
2025-05-07T20:33:16.2858843Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:16.2861344Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d826a480>}
2025-05-07T20:33:16.2863780Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:16.2865657Z context = <triton._C.libtriton.ir.context object at 0x7f38d902ee70>
2025-05-07T20:33:16.2866168Z 
2025-05-07T20:33:16.2866442Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:16.2867339Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:16.2868204Z                            module_map=module_map)
2025-05-07T20:33:16.2868809Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:16.2869403Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:16.2869819Z E       ^
2025-05-07T20:33:16.2870625Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:16.2871447Z 
2025-05-07T20:33:16.2872293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:16.4441093Z 
2025-05-07T20:33:16.4441701Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:16.4442413Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:16.4443140Z     T=16384,
2025-05-07T20:33:16.4443434Z     D=7168,
2025-05-07T20:33:16.4443738Z     scale_ub=None,
2025-05-07T20:33:16.4444072Z     contiguous=True,
2025-05-07T20:33:16.4444414Z     compiled=True,
2025-05-07T20:33:16.4444735Z )
2025-05-07T20:33:16.4445244Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:16.4446037Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:16.4446464Z 
2025-05-07T20:33:16.4446577Z     @given(
2025-05-07T20:33:16.4446923Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:16.4447452Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:16.4447924Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:16.4448871Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:16.4449400Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:16.4449838Z     )
2025-05-07T20:33:16.4450386Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:16.4451127Z     def test_silu_mul_quant(
2025-05-07T20:33:16.4451511Z         self,
2025-05-07T20:33:16.4451813Z         T: int,
2025-05-07T20:33:16.4452127Z         D: int,
2025-05-07T20:33:16.4452472Z         scale_ub: Optional[float],
2025-05-07T20:33:16.4452888Z         contiguous: bool,
2025-05-07T20:33:16.4453269Z         compiled: bool,
2025-05-07T20:33:16.4453633Z     ) -> None:
2025-05-07T20:33:16.4453965Z         torch.manual_seed(2025)
2025-05-07T20:33:16.4454500Z     
2025-05-07T20:33:16.4454923Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:16.4455462Z     
2025-05-07T20:33:16.4455765Z         x_sign = torch.sign(x)
2025-05-07T20:33:16.4456234Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:16.4456741Z         x = x_sign * x_clamp
2025-05-07T20:33:16.4457121Z         x0 = x[:, :D]
2025-05-07T20:33:16.4457466Z         x1 = x[:, D:]
2025-05-07T20:33:16.4457788Z     
2025-05-07T20:33:16.4458076Z         if contiguous:
2025-05-07T20:33:16.4458435Z             x0 = x0.contiguous()
2025-05-07T20:33:16.4458828Z             x1 = x1.contiguous()
2025-05-07T20:33:16.4459203Z     
2025-05-07T20:33:16.4459484Z         if scale_ub is not None:
2025-05-07T20:33:16.4459912Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:16.4460448Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:16.4460942Z             )
2025-05-07T20:33:16.4461239Z         else:
2025-05-07T20:33:16.4461560Z             scale_ub_tensor = None
2025-05-07T20:33:16.4461970Z     
2025-05-07T20:33:16.4462327Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:16.4462755Z             op = silu_mul_quant
2025-05-07T20:33:16.4463092Z             if compiled:
2025-05-07T20:33:16.4463436Z                 op = torch.compile(op)
2025-05-07T20:33:16.4463832Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.4464208Z     
2025-05-07T20:33:16.4464460Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:16.4464686Z 
2025-05-07T20:33:16.4464817Z moe/activation_test.py:117: 
2025-05-07T20:33:16.4465224Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.4465692Z moe/activation_test.py:115: in fn
2025-05-07T20:33:16.4466082Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.4466895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:16.4467891Z     return fn(*args, **kwargs)
2025-05-07T20:33:16.4468941Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:16.4470131Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:16.4471032Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:16.4472140Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:16.4473249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:16.4474130Z     kernel = self.compile(
2025-05-07T20:33:16.4475052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:16.4476213Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:16.4476865Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.4477274Z 
2025-05-07T20:33:16.4477612Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d86c8150>
2025-05-07T20:33:16.4479585Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:16.4482022Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d826b740>}
2025-05-07T20:33:16.4484245Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:16.4485969Z context = <triton._C.libtriton.ir.context object at 0x7f38d904cdb0>
2025-05-07T20:33:16.4486437Z 
2025-05-07T20:33:16.4486807Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:16.4487642Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:16.4488421Z                            module_map=module_map)
2025-05-07T20:33:16.4489016Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:16.4489565Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:16.4489953Z E       ^
2025-05-07T20:33:16.4490630Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:16.4491434Z 
2025-05-07T20:33:16.4492177Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:16.4493108Z 
2025-05-07T20:33:16.4493274Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:16.4493963Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:16.4494657Z     T=4096,
2025-05-07T20:33:16.4494952Z     D=5120,
2025-05-07T20:33:16.4495256Z     scale_ub=None,
2025-05-07T20:33:16.4495604Z     contiguous=False,
2025-05-07T20:33:16.4495963Z     compiled=True,
2025-05-07T20:33:16.4496297Z )
2025-05-07T20:33:16.4496834Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:16.4497680Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:16.4498157Z 
2025-05-07T20:33:16.4498279Z     @given(
2025-05-07T20:33:16.4498651Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:16.4499171Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:16.4499678Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:16.4500235Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:16.4500767Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:16.4501201Z     )
2025-05-07T20:33:16.4501761Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:16.4502478Z     def test_silu_mul_quant(
2025-05-07T20:33:16.4502846Z         self,
2025-05-07T20:33:16.4503229Z         T: int,
2025-05-07T20:33:16.4503554Z         D: int,
2025-05-07T20:33:16.4503907Z         scale_ub: Optional[float],
2025-05-07T20:33:16.4504362Z         contiguous: bool,
2025-05-07T20:33:16.4504752Z         compiled: bool,
2025-05-07T20:33:16.4505104Z     ) -> None:
2025-05-07T20:33:16.4505448Z         torch.manual_seed(2025)
2025-05-07T20:33:16.4505854Z     
2025-05-07T20:33:16.4506291Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:16.4506864Z     
2025-05-07T20:33:16.4507174Z         x_sign = torch.sign(x)
2025-05-07T20:33:16.4507736Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:16.4508258Z         x = x_sign * x_clamp
2025-05-07T20:33:16.4508652Z         x0 = x[:, :D]
2025-05-07T20:33:16.4509004Z         x1 = x[:, D:]
2025-05-07T20:33:16.4509332Z     
2025-05-07T20:33:16.4509623Z         if contiguous:
2025-05-07T20:33:16.4509994Z             x0 = x0.contiguous()
2025-05-07T20:33:16.4510420Z             x1 = x1.contiguous()
2025-05-07T20:33:16.4510810Z     
2025-05-07T20:33:16.4511120Z         if scale_ub is not None:
2025-05-07T20:33:16.4511720Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:16.4512278Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:16.4512791Z             )
2025-05-07T20:33:16.4513093Z         else:
2025-05-07T20:33:16.4513439Z             scale_ub_tensor = None
2025-05-07T20:33:16.4513853Z     
2025-05-07T20:33:16.4514218Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:16.4514747Z             op = silu_mul_quant
2025-05-07T20:33:16.4515164Z             if compiled:
2025-05-07T20:33:16.4515563Z                 op = torch.compile(op)
2025-05-07T20:33:16.4516061Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.4516524Z     
2025-05-07T20:33:16.4516908Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:16.4517184Z 
2025-05-07T20:33:16.4517342Z moe/activation_test.py:117: 
2025-05-07T20:33:16.4517836Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.4518402Z moe/activation_test.py:115: in fn
2025-05-07T20:33:16.4518863Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.4519840Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:16.4520817Z     return fn(*args, **kwargs)
2025-05-07T20:33:16.4521970Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:16.4523194Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:16.4524127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:16.4525321Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:16.4526481Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:16.4527420Z     kernel = self.compile(
2025-05-07T20:33:16.4528370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:16.4529523Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:16.4530193Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.4530593Z 
2025-05-07T20:33:16.4530931Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d9d91d50>
2025-05-07T20:33:16.4532823Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:16.4534708Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d9a74c20>}
2025-05-07T20:33:16.4536629Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:16.4538129Z context = <triton._C.libtriton.ir.context object at 0x7f38d9ad44f0>
2025-05-07T20:33:16.4538548Z 
2025-05-07T20:33:16.4538779Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:16.4539539Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:16.4540487Z                            module_map=module_map)
2025-05-07T20:33:16.4540996Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:16.4541501Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:16.4541868Z E       ^
2025-05-07T20:33:16.4542539Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:16.4543202Z 
2025-05-07T20:33:16.4544030Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:16.5940025Z 
2025-05-07T20:33:16.5940676Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:16.5941286Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:16.5941837Z     T=4096,
2025-05-07T20:33:16.5942088Z     D=5120,
2025-05-07T20:33:16.5942356Z     scale_ub=1200.0,
2025-05-07T20:33:16.5942647Z     contiguous=False,
2025-05-07T20:33:16.5942948Z     compiled=False,
2025-05-07T20:33:16.5943205Z )
2025-05-07T20:33:16.5943525Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:16.5944026Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:16.5944551Z 
2025-05-07T20:33:16.5944629Z     @given(
2025-05-07T20:33:16.5944873Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:16.5945187Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:16.5945512Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:16.5945854Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:16.5946190Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:16.5946476Z     )
2025-05-07T20:33:16.5946835Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:16.5947273Z     def test_silu_mul_quant(
2025-05-07T20:33:16.5947581Z         self,
2025-05-07T20:33:16.5947773Z         T: int,
2025-05-07T20:33:16.5947971Z         D: int,
2025-05-07T20:33:16.5948182Z         scale_ub: Optional[float],
2025-05-07T20:33:16.5948461Z         contiguous: bool,
2025-05-07T20:33:16.5948704Z         compiled: bool,
2025-05-07T20:33:16.5948921Z     ) -> None:
2025-05-07T20:33:16.5949144Z         torch.manual_seed(2025)
2025-05-07T20:33:16.5949387Z     
2025-05-07T20:33:16.5949649Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:16.5950007Z     
2025-05-07T20:33:16.5950215Z         x_sign = torch.sign(x)
2025-05-07T20:33:16.5950496Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:16.5950798Z         x = x_sign * x_clamp
2025-05-07T20:33:16.5951038Z         x0 = x[:, :D]
2025-05-07T20:33:16.5951238Z         x1 = x[:, D:]
2025-05-07T20:33:16.5951442Z     
2025-05-07T20:33:16.5951618Z         if contiguous:
2025-05-07T20:33:16.5951844Z             x0 = x0.contiguous()
2025-05-07T20:33:16.5952098Z             x1 = x1.contiguous()
2025-05-07T20:33:16.5952328Z     
2025-05-07T20:33:16.5952507Z         if scale_ub is not None:
2025-05-07T20:33:16.5952776Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:16.5953106Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:16.5953411Z             )
2025-05-07T20:33:16.5953590Z         else:
2025-05-07T20:33:16.5953795Z             scale_ub_tensor = None
2025-05-07T20:33:16.5954032Z     
2025-05-07T20:33:16.5954373Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:16.5954682Z             op = silu_mul_quant
2025-05-07T20:33:16.5954921Z             if compiled:
2025-05-07T20:33:16.5955160Z                 op = torch.compile(op)
2025-05-07T20:33:16.5955447Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.5955706Z     
2025-05-07T20:33:16.5955888Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:16.5956045Z 
2025-05-07T20:33:16.5956148Z moe/activation_test.py:117: 
2025-05-07T20:33:16.5956430Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.5956758Z moe/activation_test.py:115: in fn
2025-05-07T20:33:16.5957035Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.5957716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:16.5958395Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:16.5958935Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:16.5959759Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:16.5960409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:16.5960940Z     kernel = self.compile(
2025-05-07T20:33:16.5961490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:16.5962142Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:16.5962528Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.5962761Z 
2025-05-07T20:33:16.5962961Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d8c173d0>
2025-05-07T20:33:16.5964089Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:16.5965485Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d9a756c0>}
2025-05-07T20:33:16.5966800Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:16.5967839Z context = <triton._C.libtriton.ir.context object at 0x7f38d8307eb0>
2025-05-07T20:33:16.5968124Z 
2025-05-07T20:33:16.5968284Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:16.5968791Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:16.5969252Z                            module_map=module_map)
2025-05-07T20:33:16.5969609Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:16.5969962Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:16.5970203Z E       ^
2025-05-07T20:33:16.5970651Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:16.5971095Z 
2025-05-07T20:33:16.5971507Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:16.5972007Z 
2025-05-07T20:33:16.5972111Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:16.5972518Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:16.5972905Z     T=4096,
2025-05-07T20:33:16.5973087Z     D=5120,
2025-05-07T20:33:16.5973270Z     scale_ub=1200.0,
2025-05-07T20:33:16.5973483Z     contiguous=False,
2025-05-07T20:33:16.5973702Z     compiled=True,
2025-05-07T20:33:16.5973898Z )
2025-05-07T20:33:16.5974255Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:16.5981621Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:16.5981927Z 
2025-05-07T20:33:16.5982011Z     @given(
2025-05-07T20:33:16.5982239Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:16.5982547Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:16.5982847Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:16.5983162Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:16.5983481Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:16.5983756Z     )
2025-05-07T20:33:16.5984105Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:16.5984538Z     def test_silu_mul_quant(
2025-05-07T20:33:16.5984788Z         self,
2025-05-07T20:33:16.5984979Z         T: int,
2025-05-07T20:33:16.5985163Z         D: int,
2025-05-07T20:33:16.5985373Z         scale_ub: Optional[float],
2025-05-07T20:33:16.5985646Z         contiguous: bool,
2025-05-07T20:33:16.5985990Z         compiled: bool,
2025-05-07T20:33:16.5986213Z     ) -> None:
2025-05-07T20:33:16.5986430Z         torch.manual_seed(2025)
2025-05-07T20:33:16.5986661Z     
2025-05-07T20:33:16.5986929Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:16.5987264Z     
2025-05-07T20:33:16.5987503Z         x_sign = torch.sign(x)
2025-05-07T20:33:16.5987792Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:16.5988093Z         x = x_sign * x_clamp
2025-05-07T20:33:16.5988319Z         x0 = x[:, :D]
2025-05-07T20:33:16.5988527Z         x1 = x[:, D:]
2025-05-07T20:33:16.5988730Z     
2025-05-07T20:33:16.5988900Z         if contiguous:
2025-05-07T20:33:16.5989123Z             x0 = x0.contiguous()
2025-05-07T20:33:16.5989427Z             x1 = x1.contiguous()
2025-05-07T20:33:16.5989661Z     
2025-05-07T20:33:16.5989840Z         if scale_ub is not None:
2025-05-07T20:33:16.5990110Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:16.5990445Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:16.5990741Z             )
2025-05-07T20:33:16.5990936Z         else:
2025-05-07T20:33:16.5991142Z             scale_ub_tensor = None
2025-05-07T20:33:16.5991383Z     
2025-05-07T20:33:16.5991613Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:16.5991918Z             op = silu_mul_quant
2025-05-07T20:33:16.5992156Z             if compiled:
2025-05-07T20:33:16.5992399Z                 op = torch.compile(op)
2025-05-07T20:33:16.5992688Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.5992950Z     
2025-05-07T20:33:16.5993137Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:16.5993303Z 
2025-05-07T20:33:16.5993407Z moe/activation_test.py:117: 
2025-05-07T20:33:16.5993703Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.5994022Z moe/activation_test.py:115: in fn
2025-05-07T20:33:16.5994304Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.5994865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:16.5995411Z     return fn(*args, **kwargs)
2025-05-07T20:33:16.5996057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:16.5996728Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:16.5997259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:16.5997917Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:16.5998569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:16.5999091Z     kernel = self.compile(
2025-05-07T20:33:16.5999670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:16.6000317Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:16.6000706Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.6000926Z 
2025-05-07T20:33:16.6001133Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d894e450>
2025-05-07T20:33:16.6002187Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:16.6003542Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d9a76fc0>}
2025-05-07T20:33:16.6004866Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:16.6006006Z context = <triton._C.libtriton.ir.context object at 0x7f38d88ca0b0>
2025-05-07T20:33:16.6006289Z 
2025-05-07T20:33:16.6006455Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:16.6006957Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:16.6007421Z                            module_map=module_map)
2025-05-07T20:33:16.6007776Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:16.6008119Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:16.6008380Z E       ^
2025-05-07T20:33:16.6008839Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:16.6009325Z 
2025-05-07T20:33:16.6009754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:16.6010273Z 
2025-05-07T20:33:16.6010380Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:16.6010784Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:16.6011181Z     T=2048,
2025-05-07T20:33:16.6011360Z     D=7168,
2025-05-07T20:33:16.6011546Z     scale_ub=1200.0,
2025-05-07T20:33:16.6011766Z     contiguous=False,
2025-05-07T20:33:16.6011987Z     compiled=False,
2025-05-07T20:33:16.7977306Z )
2025-05-07T20:33:16.7977726Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:16.7978428Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:16.7978811Z 
2025-05-07T20:33:16.7978915Z     @given(
2025-05-07T20:33:16.7979221Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:16.7979653Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:16.7980056Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:16.7980476Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:16.7980852Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:16.7981137Z     )
2025-05-07T20:33:16.7981486Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:16.7981916Z     def test_silu_mul_quant(
2025-05-07T20:33:16.7982156Z         self,
2025-05-07T20:33:16.7982352Z         T: int,
2025-05-07T20:33:16.7982548Z         D: int,
2025-05-07T20:33:16.7982765Z         scale_ub: Optional[float],
2025-05-07T20:33:16.7983036Z         contiguous: bool,
2025-05-07T20:33:16.7983274Z         compiled: bool,
2025-05-07T20:33:16.7983488Z     ) -> None:
2025-05-07T20:33:16.7983697Z         torch.manual_seed(2025)
2025-05-07T20:33:16.7983940Z     
2025-05-07T20:33:16.7984204Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:16.7984551Z     
2025-05-07T20:33:16.7984750Z         x_sign = torch.sign(x)
2025-05-07T20:33:16.7985152Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:16.7985473Z         x = x_sign * x_clamp
2025-05-07T20:33:16.7985709Z         x0 = x[:, :D]
2025-05-07T20:33:16.7985913Z         x1 = x[:, D:]
2025-05-07T20:33:16.7986118Z     
2025-05-07T20:33:16.7986302Z         if contiguous:
2025-05-07T20:33:16.7986529Z             x0 = x0.contiguous()
2025-05-07T20:33:16.7986785Z             x1 = x1.contiguous()
2025-05-07T20:33:16.7987022Z     
2025-05-07T20:33:16.7987201Z         if scale_ub is not None:
2025-05-07T20:33:16.7987537Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:16.7987877Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:16.7988185Z             )
2025-05-07T20:33:16.7988369Z         else:
2025-05-07T20:33:16.7988581Z             scale_ub_tensor = None
2025-05-07T20:33:16.7988828Z     
2025-05-07T20:33:16.7989055Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:16.7989367Z             op = silu_mul_quant
2025-05-07T20:33:16.7989624Z             if compiled:
2025-05-07T20:33:16.7989989Z                 op = torch.compile(op)
2025-05-07T20:33:16.7990287Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.7990562Z     
2025-05-07T20:33:16.7990745Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:16.7990917Z 
2025-05-07T20:33:16.7991016Z moe/activation_test.py:117: 
2025-05-07T20:33:16.7991306Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.7991628Z moe/activation_test.py:115: in fn
2025-05-07T20:33:16.7991905Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.7992592Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:16.7993274Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:16.7993867Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:16.7994547Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:16.7995210Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:16.7995732Z     kernel = self.compile(
2025-05-07T20:33:16.7996262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:16.7996906Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:16.7997302Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.7997525Z 
2025-05-07T20:33:16.7997725Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d9a11250>
2025-05-07T20:33:16.7998790Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:16.8000148Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d9a77ec0>}
2025-05-07T20:33:16.8001460Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:16.8002463Z context = <triton._C.libtriton.ir.context object at 0x7f38d84001f0>
2025-05-07T20:33:16.8002747Z 
2025-05-07T20:33:16.8002908Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:16.8003429Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:16.8003895Z                            module_map=module_map)
2025-05-07T20:33:16.8004252Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:16.8004605Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:16.8004903Z E       ^
2025-05-07T20:33:16.8005374Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:16.8005812Z 
2025-05-07T20:33:16.8006227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:16.8006736Z 
2025-05-07T20:33:16.8006833Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:16.8007235Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:16.8007625Z     T=1,
2025-05-07T20:33:16.8007801Z     D=7168,
2025-05-07T20:33:16.8007994Z     scale_ub=None,
2025-05-07T20:33:16.8008209Z     contiguous=True,
2025-05-07T20:33:16.8008425Z     compiled=False,
2025-05-07T20:33:16.8008624Z )
2025-05-07T20:33:16.8008939Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:16.8009414Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:16.8009683Z 
2025-05-07T20:33:16.8009807Z     @given(
2025-05-07T20:33:16.8010073Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:16.8010379Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:16.8010680Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:16.8011007Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:16.8011332Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:16.8011607Z     )
2025-05-07T20:33:16.8011956Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:16.8012403Z     def test_silu_mul_quant(
2025-05-07T20:33:16.8012639Z         self,
2025-05-07T20:33:16.8012839Z         T: int,
2025-05-07T20:33:16.8013034Z         D: int,
2025-05-07T20:33:16.8013321Z         scale_ub: Optional[float],
2025-05-07T20:33:16.8013588Z         contiguous: bool,
2025-05-07T20:33:16.8013828Z         compiled: bool,
2025-05-07T20:33:16.8014045Z     ) -> None:
2025-05-07T20:33:16.8014262Z         torch.manual_seed(2025)
2025-05-07T20:33:16.8014512Z     
2025-05-07T20:33:16.8014786Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:16.8015127Z     
2025-05-07T20:33:16.8015313Z         x_sign = torch.sign(x)
2025-05-07T20:33:16.8015611Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:16.8015920Z         x = x_sign * x_clamp
2025-05-07T20:33:16.8016154Z         x0 = x[:, :D]
2025-05-07T20:33:16.8016375Z         x1 = x[:, D:]
2025-05-07T20:33:16.8016583Z     
2025-05-07T20:33:16.8016761Z         if contiguous:
2025-05-07T20:33:16.8016989Z             x0 = x0.contiguous()
2025-05-07T20:33:16.8017243Z             x1 = x1.contiguous()
2025-05-07T20:33:16.8017480Z     
2025-05-07T20:33:16.8017666Z         if scale_ub is not None:
2025-05-07T20:33:16.8017941Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:16.8018275Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:16.8018581Z             )
2025-05-07T20:33:16.8018780Z         else:
2025-05-07T20:33:16.8018994Z             scale_ub_tensor = None
2025-05-07T20:33:16.8019237Z     
2025-05-07T20:33:16.8019460Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:16.8019768Z             op = silu_mul_quant
2025-05-07T20:33:16.8020009Z             if compiled:
2025-05-07T20:33:16.8020264Z                 op = torch.compile(op)
2025-05-07T20:33:16.8020554Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.8020819Z     
2025-05-07T20:33:16.8021006Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:16.8021165Z 
2025-05-07T20:33:16.8021267Z moe/activation_test.py:117: 
2025-05-07T20:33:16.8021559Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.8021880Z moe/activation_test.py:115: in fn
2025-05-07T20:33:16.8022156Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.8022883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:16.8023563Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:16.8024114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:16.8024797Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:16.8025496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:16.8026033Z     kernel = self.compile(
2025-05-07T20:33:16.8026573Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:16.8027219Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:16.8027655Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.8027882Z 
2025-05-07T20:33:16.8028089Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d8d0eed0>
2025-05-07T20:33:16.8029232Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:16.8030583Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d88d4cc0>}
2025-05-07T20:33:16.8031900Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:16.8032952Z context = <triton._C.libtriton.ir.context object at 0x7f38d8191870>
2025-05-07T20:33:16.8033277Z 
2025-05-07T20:33:16.8033443Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:16.8033962Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:16.8034440Z                            module_map=module_map)
2025-05-07T20:33:16.8034796Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:16.8035163Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:16.8035422Z E       ^
2025-05-07T20:33:16.8035879Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:16.8036328Z 
2025-05-07T20:33:16.8036746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:16.8037258Z 
2025-05-07T20:33:16.8037361Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:16.8037772Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:16.8038175Z     T=16384,
2025-05-07T20:33:16.8038369Z     D=7168,
2025-05-07T20:33:16.8038578Z     scale_ub=1200.0,
2025-05-07T20:33:16.8038795Z     contiguous=False,
2025-05-07T20:33:16.8039026Z     compiled=True,
2025-05-07T20:33:16.8039232Z )
2025-05-07T20:33:16.8039540Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:16.8040030Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:16.8040506Z 
2025-05-07T20:33:16.8040595Z     @given(
2025-05-07T20:33:16.8040821Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:16.8041127Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:16.8041428Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:16.8041756Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:16.8042072Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:16.8042356Z     )
2025-05-07T20:33:16.8042706Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:16.8043206Z     def test_silu_mul_quant(
2025-05-07T20:33:16.8043445Z         self,
2025-05-07T20:33:16.8043643Z         T: int,
2025-05-07T20:33:16.8043832Z         D: int,
2025-05-07T20:33:16.8044048Z         scale_ub: Optional[float],
2025-05-07T20:33:16.8044316Z         contiguous: bool,
2025-05-07T20:33:16.8044546Z         compiled: bool,
2025-05-07T20:33:16.8044772Z     ) -> None:
2025-05-07T20:33:16.8044982Z         torch.manual_seed(2025)
2025-05-07T20:33:16.8045219Z     
2025-05-07T20:33:16.8045477Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:16.8045811Z     
2025-05-07T20:33:16.8046001Z         x_sign = torch.sign(x)
2025-05-07T20:33:16.8046280Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:16.8046583Z         x = x_sign * x_clamp
2025-05-07T20:33:16.8046818Z         x0 = x[:, :D]
2025-05-07T20:33:16.8047032Z         x1 = x[:, D:]
2025-05-07T20:33:16.8047243Z     
2025-05-07T20:33:16.8047428Z         if contiguous:
2025-05-07T20:33:16.8047658Z             x0 = x0.contiguous()
2025-05-07T20:33:16.8047913Z             x1 = x1.contiguous()
2025-05-07T20:33:16.8048272Z     
2025-05-07T20:33:16.8048460Z         if scale_ub is not None:
2025-05-07T20:33:16.8048729Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:16.8049056Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:16.8049358Z             )
2025-05-07T20:33:16.8049560Z         else:
2025-05-07T20:33:16.8049777Z             scale_ub_tensor = None
2025-05-07T20:33:16.8050016Z     
2025-05-07T20:33:16.8050254Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:16.8050570Z             op = silu_mul_quant
2025-05-07T20:33:16.8050827Z             if compiled:
2025-05-07T20:33:16.8051070Z                 op = torch.compile(op)
2025-05-07T20:33:16.8051372Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.8051715Z     
2025-05-07T20:33:16.8051902Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:16.8052074Z 
2025-05-07T20:33:16.8052176Z moe/activation_test.py:117: 
2025-05-07T20:33:16.8052470Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.8052789Z moe/activation_test.py:115: in fn
2025-05-07T20:33:16.8053064Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.8053613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:16.8054160Z     return fn(*args, **kwargs)
2025-05-07T20:33:16.8054807Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:16.8055479Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:16.8056007Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:16.8056673Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:16.8057331Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:16.8057858Z     kernel = self.compile(
2025-05-07T20:33:16.8058411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:16.8059044Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:16.8059436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.8059658Z 
2025-05-07T20:33:16.8059864Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d814efd0>
2025-05-07T20:33:16.8060920Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:16.8062318Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d88d60c0>}
2025-05-07T20:33:16.8063636Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:16.8064638Z context = <triton._C.libtriton.ir.context object at 0x7f38d8144970>
2025-05-07T20:33:16.8064918Z 
2025-05-07T20:33:16.8065088Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:16.8065598Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:16.8066066Z                            module_map=module_map)
2025-05-07T20:33:16.8066421Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:16.8066774Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:16.8067020Z E       ^
2025-05-07T20:33:16.8067520Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:16.8068023Z 
2025-05-07T20:33:16.8068488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:16.9411536Z 
2025-05-07T20:33:16.9411919Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:16.9412380Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:16.9412780Z     T=1,
2025-05-07T20:33:16.9412966Z     D=7168,
2025-05-07T20:33:16.9413155Z     scale_ub=None,
2025-05-07T20:33:16.9413362Z     contiguous=False,
2025-05-07T20:33:16.9413627Z     compiled=False,
2025-05-07T20:33:16.9413826Z )
2025-05-07T20:33:16.9414142Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:16.9414626Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:16.9415044Z 
2025-05-07T20:33:16.9415120Z     @given(
2025-05-07T20:33:16.9415350Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:16.9415663Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:16.9415960Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:16.9416295Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:16.9416625Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:16.9416910Z     )
2025-05-07T20:33:16.9417246Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:16.9417697Z     def test_silu_mul_quant(
2025-05-07T20:33:16.9417940Z         self,
2025-05-07T20:33:16.9418122Z         T: int,
2025-05-07T20:33:16.9418312Z         D: int,
2025-05-07T20:33:16.9418525Z         scale_ub: Optional[float],
2025-05-07T20:33:16.9418781Z         contiguous: bool,
2025-05-07T20:33:16.9419009Z         compiled: bool,
2025-05-07T20:33:16.9419234Z     ) -> None:
2025-05-07T20:33:16.9419434Z         torch.manual_seed(2025)
2025-05-07T20:33:16.9419666Z     
2025-05-07T20:33:16.9419935Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:16.9420274Z     
2025-05-07T20:33:16.9420464Z         x_sign = torch.sign(x)
2025-05-07T20:33:16.9420755Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:16.9421047Z         x = x_sign * x_clamp
2025-05-07T20:33:16.9421290Z         x0 = x[:, :D]
2025-05-07T20:33:16.9421515Z         x1 = x[:, D:]
2025-05-07T20:33:16.9421728Z     
2025-05-07T20:33:16.9421905Z         if contiguous:
2025-05-07T20:33:16.9422149Z             x0 = x0.contiguous()
2025-05-07T20:33:16.9422411Z             x1 = x1.contiguous()
2025-05-07T20:33:16.9422634Z     
2025-05-07T20:33:16.9422816Z         if scale_ub is not None:
2025-05-07T20:33:16.9423093Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:16.9423418Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:16.9423728Z             )
2025-05-07T20:33:16.9423917Z         else:
2025-05-07T20:33:16.9424192Z             scale_ub_tensor = None
2025-05-07T20:33:16.9424440Z     
2025-05-07T20:33:16.9424667Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:16.9424961Z             op = silu_mul_quant
2025-05-07T20:33:16.9425201Z             if compiled:
2025-05-07T20:33:16.9425438Z                 op = torch.compile(op)
2025-05-07T20:33:16.9425717Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.9425981Z     
2025-05-07T20:33:16.9426164Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:16.9426322Z 
2025-05-07T20:33:16.9426418Z moe/activation_test.py:117: 
2025-05-07T20:33:16.9426698Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.9427032Z moe/activation_test.py:115: in fn
2025-05-07T20:33:16.9427308Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.9428057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:16.9439889Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:16.9440816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:16.9441562Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:16.9442237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:16.9442753Z     kernel = self.compile(
2025-05-07T20:33:16.9443297Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:16.9443943Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:16.9444333Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.9444631Z 
2025-05-07T20:33:16.9444833Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d8d0df50>
2025-05-07T20:33:16.9445944Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:16.9447309Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d88d6c00>}
2025-05-07T20:33:16.9448621Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:16.9449616Z context = <triton._C.libtriton.ir.context object at 0x7f38d83b2970>
2025-05-07T20:33:16.9449900Z 
2025-05-07T20:33:16.9450061Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:16.9450579Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:16.9451041Z                            module_map=module_map)
2025-05-07T20:33:16.9451395Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:16.9451737Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:16.9451989Z E       ^
2025-05-07T20:33:16.9452435Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:16.9452883Z 
2025-05-07T20:33:16.9453298Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:16.9453802Z 
2025-05-07T20:33:16.9453899Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:16.9454302Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:16.9454689Z     T=2048,
2025-05-07T20:33:16.9454876Z     D=7168,
2025-05-07T20:33:16.9455059Z     scale_ub=None,
2025-05-07T20:33:16.9455265Z     contiguous=False,
2025-05-07T20:33:16.9455485Z     compiled=True,
2025-05-07T20:33:16.9455744Z )
2025-05-07T20:33:16.9456051Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:16.9456536Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:16.9456807Z 
2025-05-07T20:33:16.9456879Z     @given(
2025-05-07T20:33:16.9457096Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:16.9457390Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:16.9457683Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:16.9458002Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:16.9458312Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:16.9458587Z     )
2025-05-07T20:33:16.9458925Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:16.9459357Z     def test_silu_mul_quant(
2025-05-07T20:33:16.9459585Z         self,
2025-05-07T20:33:16.9459777Z         T: int,
2025-05-07T20:33:16.9459963Z         D: int,
2025-05-07T20:33:16.9460172Z         scale_ub: Optional[float],
2025-05-07T20:33:16.9461095Z         contiguous: bool,
2025-05-07T20:33:16.9461327Z         compiled: bool,
2025-05-07T20:33:16.9461541Z     ) -> None:
2025-05-07T20:33:16.9461757Z         torch.manual_seed(2025)
2025-05-07T20:33:16.9461993Z     
2025-05-07T20:33:16.9462251Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:16.9462590Z     
2025-05-07T20:33:16.9462774Z         x_sign = torch.sign(x)
2025-05-07T20:33:16.9463050Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:16.9463355Z         x = x_sign * x_clamp
2025-05-07T20:33:16.9463586Z         x0 = x[:, :D]
2025-05-07T20:33:16.9463784Z         x1 = x[:, D:]
2025-05-07T20:33:16.9463982Z     
2025-05-07T20:33:16.9464157Z         if contiguous:
2025-05-07T20:33:16.9464417Z             x0 = x0.contiguous()
2025-05-07T20:33:16.9464663Z             x1 = x1.contiguous()
2025-05-07T20:33:16.9464897Z     
2025-05-07T20:33:16.9465076Z         if scale_ub is not None:
2025-05-07T20:33:16.9465346Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:16.9465670Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:16.9465978Z             )
2025-05-07T20:33:16.9466157Z         else:
2025-05-07T20:33:16.9466365Z             scale_ub_tensor = None
2025-05-07T20:33:16.9466603Z     
2025-05-07T20:33:16.9466816Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:16.9467117Z             op = silu_mul_quant
2025-05-07T20:33:16.9467353Z             if compiled:
2025-05-07T20:33:16.9467647Z                 op = torch.compile(op)
2025-05-07T20:33:16.9467930Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.9468191Z     
2025-05-07T20:33:16.9468368Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:16.9468541Z 
2025-05-07T20:33:16.9468634Z moe/activation_test.py:117: 
2025-05-07T20:33:16.9468924Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.9469250Z moe/activation_test.py:115: in fn
2025-05-07T20:33:16.9469521Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:16.9470093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:16.9470638Z     return fn(*args, **kwargs)
2025-05-07T20:33:16.9471282Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:16.9471957Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:16.9472485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:16.9473149Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:16.9473798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:16.9474314Z     kernel = self.compile(
2025-05-07T20:33:16.9474918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:16.9475609Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:16.9475997Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:16.9476222Z 
2025-05-07T20:33:16.9476419Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d91fb7d0>
2025-05-07T20:33:16.9477519Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:16.9478855Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d93802c0>}
2025-05-07T20:33:16.9480211Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:16.9481245Z context = <triton._C.libtriton.ir.context object at 0x7f38d935b0b0>
2025-05-07T20:33:16.9481524Z 
2025-05-07T20:33:16.9481693Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:16.9482209Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:16.9482660Z                            module_map=module_map)
2025-05-07T20:33:16.9483017Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:16.9483373Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:16.9483622Z E       ^
2025-05-07T20:33:16.9484075Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:16.9484556Z 
2025-05-07T20:33:16.9484979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:16.9485490Z 
2025-05-07T20:33:16.9485595Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:16.9485996Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:16.9486387Z     T=4096,
2025-05-07T20:33:16.9486571Z     D=7168,
2025-05-07T20:33:16.9486751Z     scale_ub=None,
2025-05-07T20:33:16.9486963Z     contiguous=False,
2025-05-07T20:33:16.9487184Z     compiled=True,
2025-05-07T20:33:17.3597426Z )
2025-05-07T20:33:17.3598127Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:17.3598831Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:17.3599197Z 
2025-05-07T20:33:17.3599306Z     @given(
2025-05-07T20:33:17.3599657Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:17.3600111Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:17.3600439Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:17.3600767Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:17.3601093Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:17.3601373Z     )
2025-05-07T20:33:17.3601708Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:17.3602144Z     def test_silu_mul_quant(
2025-05-07T20:33:17.3602391Z         self,
2025-05-07T20:33:17.3602578Z         T: int,
2025-05-07T20:33:17.3602772Z         D: int,
2025-05-07T20:33:17.3602992Z         scale_ub: Optional[float],
2025-05-07T20:33:17.3603251Z         contiguous: bool,
2025-05-07T20:33:17.3603493Z         compiled: bool,
2025-05-07T20:33:17.3603716Z     ) -> None:
2025-05-07T20:33:17.3603918Z         torch.manual_seed(2025)
2025-05-07T20:33:17.3604160Z     
2025-05-07T20:33:17.3604430Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:17.3604772Z     
2025-05-07T20:33:17.3605096Z         x_sign = torch.sign(x)
2025-05-07T20:33:17.3605392Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:17.3605696Z         x = x_sign * x_clamp
2025-05-07T20:33:17.3605927Z         x0 = x[:, :D]
2025-05-07T20:33:17.3606138Z         x1 = x[:, D:]
2025-05-07T20:33:17.3606342Z     
2025-05-07T20:33:17.3606514Z         if contiguous:
2025-05-07T20:33:17.3606737Z             x0 = x0.contiguous()
2025-05-07T20:33:17.3606989Z             x1 = x1.contiguous()
2025-05-07T20:33:17.3607216Z     
2025-05-07T20:33:17.3607401Z         if scale_ub is not None:
2025-05-07T20:33:17.3607672Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:17.3607995Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:17.3608297Z             )
2025-05-07T20:33:17.3608489Z         else:
2025-05-07T20:33:17.3608696Z             scale_ub_tensor = None
2025-05-07T20:33:17.3608951Z     
2025-05-07T20:33:17.3609183Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:17.3609483Z             op = silu_mul_quant
2025-05-07T20:33:17.3609910Z             if compiled:
2025-05-07T20:33:17.3610147Z                 op = torch.compile(op)
2025-05-07T20:33:17.3610440Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:17.3610711Z     
2025-05-07T20:33:17.3610888Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:17.3611052Z 
2025-05-07T20:33:17.3611147Z moe/activation_test.py:117: 
2025-05-07T20:33:17.3611433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:17.3611750Z moe/activation_test.py:115: in fn
2025-05-07T20:33:17.3612013Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:17.3612560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:17.3613171Z     return fn(*args, **kwargs)
2025-05-07T20:33:17.3613817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:17.3614489Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:17.3615037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:17.3615699Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:17.3616346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:17.3616871Z     kernel = self.compile(
2025-05-07T20:33:17.3617402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:17.3618033Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:17.3618415Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:17.3618637Z 
2025-05-07T20:33:17.3618842Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d894fa50>
2025-05-07T20:33:17.3619907Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:17.3621269Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d9380d60>}
2025-05-07T20:33:17.3622580Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:17.3623584Z context = <triton._C.libtriton.ir.context object at 0x7f38d93c4470>
2025-05-07T20:33:17.3623865Z 
2025-05-07T20:33:17.3624031Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:17.3624592Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:17.3625059Z                            module_map=module_map)
2025-05-07T20:33:17.3625424Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:17.3625761Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:17.3626014Z E       ^
2025-05-07T20:33:17.3626467Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:17.3626902Z 
2025-05-07T20:33:17.3627320Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:17.3627939Z 
2025-05-07T20:33:17.3628035Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:17.3628444Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:17.3628834Z     T=16384,
2025-05-07T20:33:17.3629020Z     D=5120,
2025-05-07T20:33:17.3629204Z     scale_ub=1200.0,
2025-05-07T20:33:17.3629423Z     contiguous=False,
2025-05-07T20:33:17.3629644Z     compiled=False,
2025-05-07T20:33:17.3629886Z )
2025-05-07T20:33:17.3630236Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:17.3630727Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:17.3630997Z 
2025-05-07T20:33:17.3631070Z     @given(
2025-05-07T20:33:17.3631293Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:17.3631599Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:17.3631893Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:17.3632216Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:17.3632531Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:17.3632804Z     )
2025-05-07T20:33:17.3633134Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:17.3633626Z     def test_silu_mul_quant(
2025-05-07T20:33:17.3633859Z         self,
2025-05-07T20:33:17.3634040Z         T: int,
2025-05-07T20:33:17.3634231Z         D: int,
2025-05-07T20:33:17.3634441Z         scale_ub: Optional[float],
2025-05-07T20:33:17.3634701Z         contiguous: bool,
2025-05-07T20:33:17.3634928Z         compiled: bool,
2025-05-07T20:33:17.3635144Z     ) -> None:
2025-05-07T20:33:17.3635342Z         torch.manual_seed(2025)
2025-05-07T20:33:17.3635582Z     
2025-05-07T20:33:17.3635846Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:17.3636178Z     
2025-05-07T20:33:17.3636369Z         x_sign = torch.sign(x)
2025-05-07T20:33:17.3636651Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:17.3636945Z         x = x_sign * x_clamp
2025-05-07T20:33:17.3637178Z         x0 = x[:, :D]
2025-05-07T20:33:17.3637387Z         x1 = x[:, D:]
2025-05-07T20:33:17.3637589Z     
2025-05-07T20:33:17.3637763Z         if contiguous:
2025-05-07T20:33:17.3637986Z             x0 = x0.contiguous()
2025-05-07T20:33:17.3638246Z             x1 = x1.contiguous()
2025-05-07T20:33:17.3638473Z     
2025-05-07T20:33:17.3638669Z         if scale_ub is not None:
2025-05-07T20:33:17.3638940Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:17.3639269Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:17.3639576Z             )
2025-05-07T20:33:17.3639765Z         else:
2025-05-07T20:33:17.3639970Z             scale_ub_tensor = None
2025-05-07T20:33:17.3640680Z     
2025-05-07T20:33:17.3640926Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:17.3641269Z             op = silu_mul_quant
2025-05-07T20:33:17.3641536Z             if compiled:
2025-05-07T20:33:17.3641805Z                 op = torch.compile(op)
2025-05-07T20:33:17.3642121Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:17.3642383Z     
2025-05-07T20:33:17.3642573Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:17.3642731Z 
2025-05-07T20:33:17.3642833Z moe/activation_test.py:117: 
2025-05-07T20:33:17.3643201Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:17.3643528Z moe/activation_test.py:115: in fn
2025-05-07T20:33:17.3643804Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:17.3644478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:17.3645156Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:17.3645740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:17.3646408Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:17.3647056Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:17.3647571Z     kernel = self.compile(
2025-05-07T20:33:17.3648104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:17.3648800Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:17.3649238Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:17.3649461Z 
2025-05-07T20:33:17.3649661Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d9a13550>
2025-05-07T20:33:17.3650720Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:17.3652058Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d9381c60>}
2025-05-07T20:33:17.3653434Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:17.3654436Z context = <triton._C.libtriton.ir.context object at 0x7f38d83373b0>
2025-05-07T20:33:17.3654717Z 
2025-05-07T20:33:17.3654883Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:17.3655425Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:17.3655908Z                            module_map=module_map)
2025-05-07T20:33:17.3656266Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:17.3656609Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:17.3656853Z E       ^
2025-05-07T20:33:17.3657302Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:17.3657743Z 
2025-05-07T20:33:17.3658165Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:17.3658663Z 
2025-05-07T20:33:17.3658771Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:17.3659169Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:17.3659565Z     T=16384,
2025-05-07T20:33:17.3659752Z     D=5120,
2025-05-07T20:33:17.3659928Z     scale_ub=1200.0,
2025-05-07T20:33:17.3660148Z     contiguous=True,
2025-05-07T20:33:17.3660369Z     compiled=True,
2025-05-07T20:33:17.3660562Z )
2025-05-07T20:33:17.3660876Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:17.3661357Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:17.3661621Z 
2025-05-07T20:33:17.3661703Z     @given(
2025-05-07T20:33:17.3661918Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:17.3662221Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:17.3662529Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:17.3662906Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:17.3663226Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:17.3663506Z     )
2025-05-07T20:33:17.3663837Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:17.3664267Z     def test_silu_mul_quant(
2025-05-07T20:33:17.3664498Z         self,
2025-05-07T20:33:17.3664674Z         T: int,
2025-05-07T20:33:17.3664871Z         D: int,
2025-05-07T20:33:17.3665082Z         scale_ub: Optional[float],
2025-05-07T20:33:17.3665334Z         contiguous: bool,
2025-05-07T20:33:17.3665570Z         compiled: bool,
2025-05-07T20:33:17.3665782Z     ) -> None:
2025-05-07T20:33:17.3665985Z         torch.manual_seed(2025)
2025-05-07T20:33:17.3666211Z     
2025-05-07T20:33:17.3666467Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:17.3666800Z     
2025-05-07T20:33:17.3666975Z         x_sign = torch.sign(x)
2025-05-07T20:33:17.3667253Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:17.3667626Z         x = x_sign * x_clamp
2025-05-07T20:33:17.3667969Z         x0 = x[:, :D]
2025-05-07T20:33:17.3668175Z         x1 = x[:, D:]
2025-05-07T20:33:17.3668373Z     
2025-05-07T20:33:17.3668547Z         if contiguous:
2025-05-07T20:33:17.3668772Z             x0 = x0.contiguous()
2025-05-07T20:33:17.3669018Z             x1 = x1.contiguous()
2025-05-07T20:33:17.3669244Z     
2025-05-07T20:33:17.3669425Z         if scale_ub is not None:
2025-05-07T20:33:17.3669691Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:17.3670010Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:17.3670308Z             )
2025-05-07T20:33:17.3670494Z         else:
2025-05-07T20:33:17.3670699Z             scale_ub_tensor = None
2025-05-07T20:33:17.3670935Z     
2025-05-07T20:33:17.3671208Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:17.3671516Z             op = silu_mul_quant
2025-05-07T20:33:17.3671749Z             if compiled:
2025-05-07T20:33:17.3671991Z                 op = torch.compile(op)
2025-05-07T20:33:17.3672279Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:17.3672536Z     
2025-05-07T20:33:17.3672715Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:17.3672871Z 
2025-05-07T20:33:17.3672968Z moe/activation_test.py:117: 
2025-05-07T20:33:17.3673241Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:17.3673555Z moe/activation_test.py:115: in fn
2025-05-07T20:33:17.3673820Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:17.3674364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:17.3674898Z     return fn(*args, **kwargs)
2025-05-07T20:33:17.3675543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:17.3676212Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:17.3676740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:17.3677400Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:17.3678048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:17.3678563Z     kernel = self.compile(
2025-05-07T20:33:17.3679093Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:17.3679737Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:17.3680130Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:17.3680349Z 
2025-05-07T20:33:17.3680548Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d81367d0>
2025-05-07T20:33:17.3681664Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:17.3683067Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d9383380>}
2025-05-07T20:33:17.3684374Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:17.3685391Z context = <triton._C.libtriton.ir.context object at 0x7f38d83fe5f0>
2025-05-07T20:33:17.3685702Z 
2025-05-07T20:33:17.3685861Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:17.3686371Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:17.3686827Z                            module_map=module_map)
2025-05-07T20:33:17.3687181Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:17.3687604Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:17.3687851Z E       ^
2025-05-07T20:33:17.3688298Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:17.3688735Z 
2025-05-07T20:33:17.3689149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:17.5258916Z 
2025-05-07T20:33:17.5259109Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:17.5259683Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:17.5260340Z     T=16384,
2025-05-07T20:33:17.5260593Z     D=5120,
2025-05-07T20:33:17.5268133Z     scale_ub=None,
2025-05-07T20:33:17.5268431Z     contiguous=False,
2025-05-07T20:33:17.5268738Z     compiled=True,
2025-05-07T20:33:17.5268935Z )
2025-05-07T20:33:17.5269252Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:17.5269760Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:17.5270033Z 
2025-05-07T20:33:17.5270118Z     @given(
2025-05-07T20:33:17.5270344Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:17.5270657Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:17.5270959Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:17.5271279Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:17.5271614Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:17.5271896Z     )
2025-05-07T20:33:17.5272238Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:17.5272694Z     def test_silu_mul_quant(
2025-05-07T20:33:17.5272941Z         self,
2025-05-07T20:33:17.5273131Z         T: int,
2025-05-07T20:33:17.5273327Z         D: int,
2025-05-07T20:33:17.5273549Z         scale_ub: Optional[float],
2025-05-07T20:33:17.5273813Z         contiguous: bool,
2025-05-07T20:33:17.5274057Z         compiled: bool,
2025-05-07T20:33:17.5274284Z     ) -> None:
2025-05-07T20:33:17.5274499Z         torch.manual_seed(2025)
2025-05-07T20:33:17.5274732Z     
2025-05-07T20:33:17.5275002Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:17.5275349Z     
2025-05-07T20:33:17.5275537Z         x_sign = torch.sign(x)
2025-05-07T20:33:17.5275829Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:17.5276134Z         x = x_sign * x_clamp
2025-05-07T20:33:17.5276373Z         x0 = x[:, :D]
2025-05-07T20:33:17.5276594Z         x1 = x[:, D:]
2025-05-07T20:33:17.5276800Z     
2025-05-07T20:33:17.5276979Z         if contiguous:
2025-05-07T20:33:17.5277210Z             x0 = x0.contiguous()
2025-05-07T20:33:17.5277463Z             x1 = x1.contiguous()
2025-05-07T20:33:17.5277696Z     
2025-05-07T20:33:17.5277886Z         if scale_ub is not None:
2025-05-07T20:33:17.5278273Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:17.5278604Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:17.5278911Z             )
2025-05-07T20:33:17.5279100Z         else:
2025-05-07T20:33:17.5279315Z             scale_ub_tensor = None
2025-05-07T20:33:17.5279563Z     
2025-05-07T20:33:17.5279794Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:17.5280106Z             op = silu_mul_quant
2025-05-07T20:33:17.5280346Z             if compiled:
2025-05-07T20:33:17.5280587Z                 op = torch.compile(op)
2025-05-07T20:33:17.5280879Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:17.5281143Z     
2025-05-07T20:33:17.5281341Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:17.5281507Z 
2025-05-07T20:33:17.5281607Z moe/activation_test.py:117: 
2025-05-07T20:33:17.5281896Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:17.5282227Z moe/activation_test.py:115: in fn
2025-05-07T20:33:17.5282508Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:17.5283194Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:17.5283746Z     return fn(*args, **kwargs)
2025-05-07T20:33:17.5284399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:17.5285072Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:17.5285596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:17.5286270Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:17.5286933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:17.5287506Z     kernel = self.compile(
2025-05-07T20:33:17.5288054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:17.5288710Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:17.5289100Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:17.5289324Z 
2025-05-07T20:33:17.5289531Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d9a11ed0>
2025-05-07T20:33:17.5290596Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:17.5291953Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d83485e0>}
2025-05-07T20:33:17.5293276Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:17.5294293Z context = <triton._C.libtriton.ir.context object at 0x7f38d8431030>
2025-05-07T20:33:17.5294579Z 
2025-05-07T20:33:17.5294742Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:17.5295267Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:17.5295729Z                            module_map=module_map)
2025-05-07T20:33:17.5296094Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:17.5296441Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:17.5296701Z E       ^
2025-05-07T20:33:17.5297165Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:17.5297611Z 
2025-05-07T20:33:17.5298073Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:17.5298584Z 
2025-05-07T20:33:17.5298689Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:17.5299096Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:17.5299488Z     T=2048,
2025-05-07T20:33:17.5299674Z     D=5120,
2025-05-07T20:33:17.5299861Z     scale_ub=None,
2025-05-07T20:33:17.5300073Z     contiguous=False,
2025-05-07T20:33:17.5300291Z     compiled=True,
2025-05-07T20:33:17.5300490Z )
2025-05-07T20:33:17.5300810Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:17.5301293Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:17.5301563Z 
2025-05-07T20:33:17.5301641Z     @given(
2025-05-07T20:33:17.5301869Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:17.5302180Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:17.5302481Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:17.5302815Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:17.5303225Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:17.5303503Z     )
2025-05-07T20:33:17.5303845Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:17.5304290Z     def test_silu_mul_quant(
2025-05-07T20:33:17.5304524Z         self,
2025-05-07T20:33:17.5304724Z         T: int,
2025-05-07T20:33:17.5304925Z         D: int,
2025-05-07T20:33:17.5305134Z         scale_ub: Optional[float],
2025-05-07T20:33:17.5305404Z         contiguous: bool,
2025-05-07T20:33:17.5305640Z         compiled: bool,
2025-05-07T20:33:17.5305859Z     ) -> None:
2025-05-07T20:33:17.5306075Z         torch.manual_seed(2025)
2025-05-07T20:33:17.5306316Z     
2025-05-07T20:33:17.5306629Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:17.5306969Z     
2025-05-07T20:33:17.5307160Z         x_sign = torch.sign(x)
2025-05-07T20:33:17.5307536Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:17.5307843Z         x = x_sign * x_clamp
2025-05-07T20:33:17.5308076Z         x0 = x[:, :D]
2025-05-07T20:33:17.5308292Z         x1 = x[:, D:]
2025-05-07T20:33:17.5308497Z     
2025-05-07T20:33:17.5308684Z         if contiguous:
2025-05-07T20:33:17.5308912Z             x0 = x0.contiguous()
2025-05-07T20:33:17.5309157Z             x1 = x1.contiguous()
2025-05-07T20:33:17.5309391Z     
2025-05-07T20:33:17.5309578Z         if scale_ub is not None:
2025-05-07T20:33:17.5309842Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:17.5310183Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:17.5310486Z             )
2025-05-07T20:33:17.5310679Z         else:
2025-05-07T20:33:17.5310892Z             scale_ub_tensor = None
2025-05-07T20:33:17.5311148Z     
2025-05-07T20:33:17.5311371Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:17.5311679Z             op = silu_mul_quant
2025-05-07T20:33:17.5311926Z             if compiled:
2025-05-07T20:33:17.5312174Z                 op = torch.compile(op)
2025-05-07T20:33:17.5312467Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:17.5312744Z     
2025-05-07T20:33:17.5312936Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:17.5313099Z 
2025-05-07T20:33:17.5313197Z moe/activation_test.py:117: 
2025-05-07T20:33:17.5313488Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:17.5313814Z moe/activation_test.py:115: in fn
2025-05-07T20:33:17.5314084Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:17.5314638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:17.5315187Z     return fn(*args, **kwargs)
2025-05-07T20:33:17.5315846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:17.5316562Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:17.5317103Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:17.5317766Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:17.5318412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:17.5318931Z     kernel = self.compile(
2025-05-07T20:33:17.5319463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:17.5320099Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:17.5320482Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:17.5320711Z 
2025-05-07T20:33:17.5320911Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d894e750>
2025-05-07T20:33:17.5322011Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:17.5323389Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d8349440>}
2025-05-07T20:33:17.5324692Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:17.5325782Z context = <triton._C.libtriton.ir.context object at 0x7f38d84f1930>
2025-05-07T20:33:17.5326079Z 
2025-05-07T20:33:17.5326243Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:17.5326810Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:17.5327275Z                            module_map=module_map)
2025-05-07T20:33:17.5327639Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:17.5327991Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:17.5328253Z E       ^
2025-05-07T20:33:17.5328699Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:17.5329151Z 
2025-05-07T20:33:17.5329579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:17.6946186Z 
2025-05-07T20:33:17.6946686Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:17.6947283Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:17.6947897Z     T=2048,
2025-05-07T20:33:17.6948165Z     D=5120,
2025-05-07T20:33:17.6948417Z     scale_ub=1200.0,
2025-05-07T20:33:17.6948695Z     contiguous=False,
2025-05-07T20:33:17.6948984Z     compiled=True,
2025-05-07T20:33:17.6949236Z )
2025-05-07T20:33:17.6949550Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:17.6950042Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:17.6950315Z 
2025-05-07T20:33:17.6950393Z     @given(
2025-05-07T20:33:17.6950623Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:17.6950935Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:17.6951251Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:17.6951583Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:17.6951901Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:17.6952176Z     )
2025-05-07T20:33:17.6952520Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:17.6952960Z     def test_silu_mul_quant(
2025-05-07T20:33:17.6953204Z         self,
2025-05-07T20:33:17.6953401Z         T: int,
2025-05-07T20:33:17.6953742Z         D: int,
2025-05-07T20:33:17.6953962Z         scale_ub: Optional[float],
2025-05-07T20:33:17.6954232Z         contiguous: bool,
2025-05-07T20:33:17.6954469Z         compiled: bool,
2025-05-07T20:33:17.6954695Z     ) -> None:
2025-05-07T20:33:17.6954900Z         torch.manual_seed(2025)
2025-05-07T20:33:17.6955137Z     
2025-05-07T20:33:17.6955393Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:17.6955777Z     
2025-05-07T20:33:17.6955976Z         x_sign = torch.sign(x)
2025-05-07T20:33:17.6956283Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:17.6956576Z         x = x_sign * x_clamp
2025-05-07T20:33:17.6956814Z         x0 = x[:, :D]
2025-05-07T20:33:17.6957035Z         x1 = x[:, D:]
2025-05-07T20:33:17.6957237Z     
2025-05-07T20:33:17.6957413Z         if contiguous:
2025-05-07T20:33:17.6957646Z             x0 = x0.contiguous()
2025-05-07T20:33:17.6957893Z             x1 = x1.contiguous()
2025-05-07T20:33:17.6958132Z     
2025-05-07T20:33:17.6958324Z         if scale_ub is not None:
2025-05-07T20:33:17.6958711Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:17.6959044Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:17.6959352Z             )
2025-05-07T20:33:17.6959538Z         else:
2025-05-07T20:33:17.6959737Z             scale_ub_tensor = None
2025-05-07T20:33:17.6959985Z     
2025-05-07T20:33:17.6960213Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:17.6960512Z             op = silu_mul_quant
2025-05-07T20:33:17.6960755Z             if compiled:
2025-05-07T20:33:17.6961032Z                 op = torch.compile(op)
2025-05-07T20:33:17.6961326Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:17.6961589Z     
2025-05-07T20:33:17.6961773Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:17.6962027Z 
2025-05-07T20:33:17.6962133Z moe/activation_test.py:117: 
2025-05-07T20:33:17.6962419Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:17.6962746Z moe/activation_test.py:115: in fn
2025-05-07T20:33:17.6963034Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:17.6963589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:17.6964134Z     return fn(*args, **kwargs)
2025-05-07T20:33:17.6964782Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:17.6965454Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:17.6966022Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:17.6966692Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:17.6967346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:17.6967858Z     kernel = self.compile(
2025-05-07T20:33:17.6968402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:17.6969059Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:17.6969445Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:17.6969664Z 
2025-05-07T20:33:17.6969868Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d81363d0>
2025-05-07T20:33:17.6970921Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:17.6972270Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d834a660>}
2025-05-07T20:33:17.6973635Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:17.6974643Z context = <triton._C.libtriton.ir.context object at 0x7f38d84a0e70>
2025-05-07T20:33:17.6974921Z 
2025-05-07T20:33:17.6975080Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:17.6975601Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:17.6976119Z                            module_map=module_map)
2025-05-07T20:33:17.6976477Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:17.6976827Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:17.6977078Z E       ^
2025-05-07T20:33:17.6977534Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:17.6977995Z 
2025-05-07T20:33:17.6978456Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:17.6978996Z 
2025-05-07T20:33:17.6979093Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:17.6979493Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:17.6979891Z     T=4096,
2025-05-07T20:33:17.6980069Z     D=5120,
2025-05-07T20:33:17.6980252Z     scale_ub=1200.0,
2025-05-07T20:33:17.6980470Z     contiguous=True,
2025-05-07T20:33:17.6980676Z     compiled=True,
2025-05-07T20:33:17.6980870Z )
2025-05-07T20:33:17.6981174Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:17.6981638Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:17.6981905Z 
2025-05-07T20:33:17.6982027Z     @given(
2025-05-07T20:33:17.6982250Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:17.6982550Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:17.6982845Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:17.6983174Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:17.6983489Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:17.6983757Z     )
2025-05-07T20:33:17.6984105Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:17.6984548Z     def test_silu_mul_quant(
2025-05-07T20:33:17.6984776Z         self,
2025-05-07T20:33:17.6984957Z         T: int,
2025-05-07T20:33:17.6985135Z         D: int,
2025-05-07T20:33:17.6985331Z         scale_ub: Optional[float],
2025-05-07T20:33:17.6985587Z         contiguous: bool,
2025-05-07T20:33:17.6985847Z         compiled: bool,
2025-05-07T20:33:17.6986051Z     ) -> None:
2025-05-07T20:33:17.6986248Z         torch.manual_seed(2025)
2025-05-07T20:33:17.6986475Z     
2025-05-07T20:33:17.6986727Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:17.6987058Z     
2025-05-07T20:33:17.6987237Z         x_sign = torch.sign(x)
2025-05-07T20:33:17.6987584Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:17.6987876Z         x = x_sign * x_clamp
2025-05-07T20:33:17.6988108Z         x0 = x[:, :D]
2025-05-07T20:33:17.6988322Z         x1 = x[:, D:]
2025-05-07T20:33:17.6988516Z     
2025-05-07T20:33:17.6988693Z         if contiguous:
2025-05-07T20:33:17.6988915Z             x0 = x0.contiguous()
2025-05-07T20:33:17.6989158Z             x1 = x1.contiguous()
2025-05-07T20:33:17.6989393Z     
2025-05-07T20:33:17.6989576Z         if scale_ub is not None:
2025-05-07T20:33:17.6989838Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:17.6990161Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:17.6990461Z             )
2025-05-07T20:33:17.6990643Z         else:
2025-05-07T20:33:17.6990854Z             scale_ub_tensor = None
2025-05-07T20:33:17.6991098Z     
2025-05-07T20:33:17.6991371Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:17.6991675Z             op = silu_mul_quant
2025-05-07T20:33:17.6991918Z             if compiled:
2025-05-07T20:33:17.6992157Z                 op = torch.compile(op)
2025-05-07T20:33:17.6992441Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:17.6992707Z     
2025-05-07T20:33:17.6992891Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:17.6993045Z 
2025-05-07T20:33:17.6993138Z moe/activation_test.py:117: 
2025-05-07T20:33:17.6993419Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:17.6993736Z moe/activation_test.py:115: in fn
2025-05-07T20:33:17.6994000Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:17.6994546Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:17.6995101Z     return fn(*args, **kwargs)
2025-05-07T20:33:17.6995806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:17.6996549Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:17.6997088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:17.6997758Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:17.6998400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:17.6998914Z     kernel = self.compile(
2025-05-07T20:33:17.6999448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:17.7000082Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:17.7000469Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:17.7000747Z 
2025-05-07T20:33:17.7000944Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d86c93d0>
2025-05-07T20:33:17.7002003Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:17.7003338Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d834b9c0>}
2025-05-07T20:33:17.7004638Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:17.7005632Z context = <triton._C.libtriton.ir.context object at 0x7f359bff7df0>
2025-05-07T20:33:17.7005916Z 
2025-05-07T20:33:17.7006073Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:17.7006593Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:17.7007048Z                            module_map=module_map)
2025-05-07T20:33:17.7007396Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:17.7007740Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:17.7007985Z E       ^
2025-05-07T20:33:17.7008425Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:17.7008867Z 
2025-05-07T20:33:17.7009283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:17.8718085Z 
2025-05-07T20:33:17.8718363Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:17.8718983Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:17.8719566Z     T=128,
2025-05-07T20:33:17.8719834Z     D=5120,
2025-05-07T20:33:17.8720113Z     scale_ub=1200.0,
2025-05-07T20:33:17.8720467Z     contiguous=False,
2025-05-07T20:33:17.8720740Z     compiled=True,
2025-05-07T20:33:17.8720937Z )
2025-05-07T20:33:17.8721259Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:17.8721742Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:17.8722006Z 
2025-05-07T20:33:17.8722078Z     @given(
2025-05-07T20:33:17.8722311Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:17.8722620Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:17.8722915Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:17.8723229Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:17.8723561Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:17.8723833Z     )
2025-05-07T20:33:17.8724177Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:17.8724612Z     def test_silu_mul_quant(
2025-05-07T20:33:17.8724844Z         self,
2025-05-07T20:33:17.8725032Z         T: int,
2025-05-07T20:33:17.8725374Z         D: int,
2025-05-07T20:33:17.8725649Z         scale_ub: Optional[float],
2025-05-07T20:33:17.8725909Z         contiguous: bool,
2025-05-07T20:33:17.8726146Z         compiled: bool,
2025-05-07T20:33:17.8726376Z     ) -> None:
2025-05-07T20:33:17.8726585Z         torch.manual_seed(2025)
2025-05-07T20:33:17.8726835Z     
2025-05-07T20:33:17.8727108Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:17.8727445Z     
2025-05-07T20:33:17.8727635Z         x_sign = torch.sign(x)
2025-05-07T20:33:17.8727909Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:17.8728205Z         x = x_sign * x_clamp
2025-05-07T20:33:17.8728437Z         x0 = x[:, :D]
2025-05-07T20:33:17.8728640Z         x1 = x[:, D:]
2025-05-07T20:33:17.8728902Z     
2025-05-07T20:33:17.8729072Z         if contiguous:
2025-05-07T20:33:17.8729295Z             x0 = x0.contiguous()
2025-05-07T20:33:17.8729546Z             x1 = x1.contiguous()
2025-05-07T20:33:17.8729771Z     
2025-05-07T20:33:17.8729960Z         if scale_ub is not None:
2025-05-07T20:33:17.8730229Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:17.8730550Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:17.8730849Z             )
2025-05-07T20:33:17.8731038Z         else:
2025-05-07T20:33:17.8731238Z             scale_ub_tensor = None
2025-05-07T20:33:17.8731480Z     
2025-05-07T20:33:17.8731704Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:17.8732003Z             op = silu_mul_quant
2025-05-07T20:33:17.8738960Z             if compiled:
2025-05-07T20:33:17.8739242Z                 op = torch.compile(op)
2025-05-07T20:33:17.8739543Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:17.8739831Z     
2025-05-07T20:33:17.8740023Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:17.8740441Z 
2025-05-07T20:33:17.8740540Z moe/activation_test.py:117: 
2025-05-07T20:33:17.8740835Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:17.8741166Z moe/activation_test.py:115: in fn
2025-05-07T20:33:17.8741438Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:17.8741999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:17.8742552Z     return fn(*args, **kwargs)
2025-05-07T20:33:17.8743205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:17.8743881Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:17.8744416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:17.8745104Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:17.8745862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:17.8746388Z     kernel = self.compile(
2025-05-07T20:33:17.8746934Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:17.8747634Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:17.8748019Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:17.8748247Z 
2025-05-07T20:33:17.8748447Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d8d0ebd0>
2025-05-07T20:33:17.8749555Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:17.8750906Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f359bf04fe0>}
2025-05-07T20:33:17.8752285Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:17.8753354Z context = <triton._C.libtriton.ir.context object at 0x7f359bfe3ab0>
2025-05-07T20:33:17.8753638Z 
2025-05-07T20:33:17.8753801Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:17.8754313Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:17.8754764Z                            module_map=module_map)
2025-05-07T20:33:17.8755127Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:17.8755474Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:17.8755788Z E       ^
2025-05-07T20:33:17.8756242Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:17.8756687Z 
2025-05-07T20:33:17.8757107Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:17.8757608Z 
2025-05-07T20:33:17.8757713Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:17.8758108Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:17.8758507Z     T=16384,
2025-05-07T20:33:17.8758693Z     D=7168,
2025-05-07T20:33:17.8758875Z     scale_ub=1200.0,
2025-05-07T20:33:17.8759089Z     contiguous=True,
2025-05-07T20:33:17.8759307Z     compiled=True,
2025-05-07T20:33:17.8759503Z )
2025-05-07T20:33:17.8759806Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:17.8760293Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:17.8760566Z 
2025-05-07T20:33:17.8760646Z     @given(
2025-05-07T20:33:17.8760863Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:17.8761175Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:17.8761486Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:17.8761805Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:17.8762121Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:17.8762401Z     )
2025-05-07T20:33:17.8762757Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:17.8763202Z     def test_silu_mul_quant(
2025-05-07T20:33:17.8763445Z         self,
2025-05-07T20:33:17.8763635Z         T: int,
2025-05-07T20:33:17.8763823Z         D: int,
2025-05-07T20:33:17.8764044Z         scale_ub: Optional[float],
2025-05-07T20:33:17.8764308Z         contiguous: bool,
2025-05-07T20:33:17.8764540Z         compiled: bool,
2025-05-07T20:33:17.8764765Z     ) -> None:
2025-05-07T20:33:17.8764974Z         torch.manual_seed(2025)
2025-05-07T20:33:17.8765205Z     
2025-05-07T20:33:17.8765519Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:17.8765908Z     
2025-05-07T20:33:17.8766098Z         x_sign = torch.sign(x)
2025-05-07T20:33:17.8766381Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:17.8766683Z         x = x_sign * x_clamp
2025-05-07T20:33:17.8766913Z         x0 = x[:, :D]
2025-05-07T20:33:17.8767122Z         x1 = x[:, D:]
2025-05-07T20:33:17.8767327Z     
2025-05-07T20:33:17.8767503Z         if contiguous:
2025-05-07T20:33:17.8767725Z             x0 = x0.contiguous()
2025-05-07T20:33:17.8767974Z             x1 = x1.contiguous()
2025-05-07T20:33:17.8768202Z     
2025-05-07T20:33:17.8768378Z         if scale_ub is not None:
2025-05-07T20:33:17.8768645Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:17.8768975Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:17.8769273Z             )
2025-05-07T20:33:17.8769462Z         else:
2025-05-07T20:33:17.8769667Z             scale_ub_tensor = None
2025-05-07T20:33:17.8769905Z     
2025-05-07T20:33:17.8770133Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:17.8770520Z             op = silu_mul_quant
2025-05-07T20:33:17.8770761Z             if compiled:
2025-05-07T20:33:17.8771003Z                 op = torch.compile(op)
2025-05-07T20:33:17.8771295Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:17.8771556Z     
2025-05-07T20:33:17.8771744Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:17.8771907Z 
2025-05-07T20:33:17.8772006Z moe/activation_test.py:117: 
2025-05-07T20:33:17.8772288Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:17.8772606Z moe/activation_test.py:115: in fn
2025-05-07T20:33:17.8772881Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:17.8773429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:17.8774015Z     return fn(*args, **kwargs)
2025-05-07T20:33:17.8774670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:17.8775348Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:17.8775882Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:17.8776554Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:17.8777207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:17.8777732Z     kernel = self.compile(
2025-05-07T20:33:17.8778267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:17.8778910Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:17.8779309Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:17.8779530Z 
2025-05-07T20:33:17.8779740Z self = <triton.compiler.compiler.ASTSource object at 0x7f359bf62fd0>
2025-05-07T20:33:17.8780797Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:17.8782141Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f359bf05e40>}
2025-05-07T20:33:17.8783452Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:17.8784454Z context = <triton._C.libtriton.ir.context object at 0x7f38d80031b0>
2025-05-07T20:33:17.8784735Z 
2025-05-07T20:33:17.8784903Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:17.8785456Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:17.8785915Z                            module_map=module_map)
2025-05-07T20:33:17.8786270Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:17.8786615Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:17.8786868Z E       ^
2025-05-07T20:33:17.8787318Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:17.8787807Z 
2025-05-07T20:33:17.8788245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:17.9951705Z 
2025-05-07T20:33:17.9952078Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:17.9952679Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:17.9953236Z     T=16384,
2025-05-07T20:33:17.9953490Z     D=5120,
2025-05-07T20:33:17.9953756Z     scale_ub=1200.0,
2025-05-07T20:33:17.9954110Z     contiguous=True,
2025-05-07T20:33:17.9954406Z     compiled=False,
2025-05-07T20:33:17.9954607Z )
2025-05-07T20:33:17.9954954Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:17.9955445Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:17.9955715Z 
2025-05-07T20:33:17.9955796Z     @given(
2025-05-07T20:33:17.9956019Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:17.9956333Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:17.9956635Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:17.9956950Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:17.9957268Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:17.9957611Z     )
2025-05-07T20:33:17.9957949Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:17.9958384Z     def test_silu_mul_quant(
2025-05-07T20:33:17.9958620Z         self,
2025-05-07T20:33:17.9958804Z         T: int,
2025-05-07T20:33:17.9958997Z         D: int,
2025-05-07T20:33:17.9959208Z         scale_ub: Optional[float],
2025-05-07T20:33:17.9959472Z         contiguous: bool,
2025-05-07T20:33:17.9959697Z         compiled: bool,
2025-05-07T20:33:17.9959916Z     ) -> None:
2025-05-07T20:33:17.9960125Z         torch.manual_seed(2025)
2025-05-07T20:33:17.9960357Z     
2025-05-07T20:33:17.9960617Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:17.9960943Z     
2025-05-07T20:33:17.9961122Z         x_sign = torch.sign(x)
2025-05-07T20:33:17.9961407Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:17.9961723Z         x = x_sign * x_clamp
2025-05-07T20:33:17.9961953Z         x0 = x[:, :D]
2025-05-07T20:33:17.9962167Z         x1 = x[:, D:]
2025-05-07T20:33:17.9962370Z     
2025-05-07T20:33:17.9962545Z         if contiguous:
2025-05-07T20:33:17.9962778Z             x0 = x0.contiguous()
2025-05-07T20:33:17.9963040Z             x1 = x1.contiguous()
2025-05-07T20:33:17.9963274Z     
2025-05-07T20:33:17.9963465Z         if scale_ub is not None:
2025-05-07T20:33:17.9963732Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:17.9964062Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:17.9964369Z             )
2025-05-07T20:33:17.9964560Z         else:
2025-05-07T20:33:17.9964770Z             scale_ub_tensor = None
2025-05-07T20:33:17.9965019Z     
2025-05-07T20:33:17.9965249Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:17.9965579Z             op = silu_mul_quant
2025-05-07T20:33:17.9965844Z             if compiled:
2025-05-07T20:33:17.9966080Z                 op = torch.compile(op)
2025-05-07T20:33:17.9966369Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:17.9966631Z     
2025-05-07T20:33:17.9966829Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:17.9966991Z 
2025-05-07T20:33:17.9967163Z moe/activation_test.py:117: 
2025-05-07T20:33:17.9967461Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:17.9967786Z moe/activation_test.py:115: in fn
2025-05-07T20:33:17.9968067Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:17.9968747Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:17.9969424Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:17.9969952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:17.9970622Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:17.9971272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:17.9971801Z     kernel = self.compile(
2025-05-07T20:33:17.9972364Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:17.9973119Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:17.9973511Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:17.9973735Z 
2025-05-07T20:33:17.9973941Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d93cb450>
2025-05-07T20:33:17.9975020Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:17.9976373Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f359bf06ca0>}
2025-05-07T20:33:17.9977741Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:17.9978750Z context = <triton._C.libtriton.ir.context object at 0x7f359bb882b0>
2025-05-07T20:33:17.9979030Z 
2025-05-07T20:33:17.9979194Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:17.9979699Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:17.9980160Z                            module_map=module_map)
2025-05-07T20:33:17.9980519Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:17.9980872Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:17.9981121Z E       ^
2025-05-07T20:33:17.9981576Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:17.9982019Z 
2025-05-07T20:33:17.9982453Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:17.9982959Z 
2025-05-07T20:33:17.9983067Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:17.9983467Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:17.9983869Z     T=1,
2025-05-07T20:33:17.9984047Z     D=7168,
2025-05-07T20:33:17.9984231Z     scale_ub=1200.0,
2025-05-07T20:33:17.9984451Z     contiguous=False,
2025-05-07T20:33:17.9984672Z     compiled=False,
2025-05-07T20:33:17.9984868Z )
2025-05-07T20:33:17.9985181Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:17.9985663Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:17.9985924Z 
2025-05-07T20:33:17.9986002Z     @given(
2025-05-07T20:33:17.9986230Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:17.9986543Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:17.9986891Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:17.9987212Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:17.9987609Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:17.9987886Z     )
2025-05-07T20:33:17.9988222Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:17.9988668Z     def test_silu_mul_quant(
2025-05-07T20:33:17.9988901Z         self,
2025-05-07T20:33:17.9989086Z         T: int,
2025-05-07T20:33:17.9989277Z         D: int,
2025-05-07T20:33:17.9989498Z         scale_ub: Optional[float],
2025-05-07T20:33:17.9989756Z         contiguous: bool,
2025-05-07T20:33:17.9989996Z         compiled: bool,
2025-05-07T20:33:17.9990217Z     ) -> None:
2025-05-07T20:33:17.9990427Z         torch.manual_seed(2025)
2025-05-07T20:33:17.9990667Z     
2025-05-07T20:33:17.9990942Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:17.9991274Z     
2025-05-07T20:33:17.9991464Z         x_sign = torch.sign(x)
2025-05-07T20:33:17.9991755Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:17.9992153Z         x = x_sign * x_clamp
2025-05-07T20:33:17.9992381Z         x0 = x[:, :D]
2025-05-07T20:33:17.9992602Z         x1 = x[:, D:]
2025-05-07T20:33:17.9992808Z     
2025-05-07T20:33:17.9992991Z         if contiguous:
2025-05-07T20:33:17.9993228Z             x0 = x0.contiguous()
2025-05-07T20:33:17.9993481Z             x1 = x1.contiguous()
2025-05-07T20:33:17.9993717Z     
2025-05-07T20:33:17.9993904Z         if scale_ub is not None:
2025-05-07T20:33:17.9994183Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:17.9994510Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:17.9994805Z             )
2025-05-07T20:33:17.9994988Z         else:
2025-05-07T20:33:17.9995191Z             scale_ub_tensor = None
2025-05-07T20:33:17.9995487Z     
2025-05-07T20:33:17.9995755Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:17.9996063Z             op = silu_mul_quant
2025-05-07T20:33:17.9996310Z             if compiled:
2025-05-07T20:33:17.9996556Z                 op = torch.compile(op)
2025-05-07T20:33:17.9996846Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:17.9997108Z     
2025-05-07T20:33:17.9997300Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:17.9997461Z 
2025-05-07T20:33:17.9997557Z moe/activation_test.py:117: 
2025-05-07T20:33:17.9997839Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:17.9998155Z moe/activation_test.py:115: in fn
2025-05-07T20:33:17.9998423Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:17.9999090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:17.9999759Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:18.0000287Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:18.0000953Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:18.0001605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:18.0002132Z     kernel = self.compile(
2025-05-07T20:33:18.0002677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:18.0003312Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:18.0003706Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.0003933Z 
2025-05-07T20:33:18.0004134Z self = <triton.compiler.compiler.ASTSource object at 0x7f359bda7650>
2025-05-07T20:33:18.0005255Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:18.0006620Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d80c40e0>}
2025-05-07T20:33:18.0007942Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:18.0008952Z context = <triton._C.libtriton.ir.context object at 0x7f38d804b270>
2025-05-07T20:33:18.0009238Z 
2025-05-07T20:33:18.0009400Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:18.0009920Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:18.0010386Z                            module_map=module_map)
2025-05-07T20:33:18.0010749Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:18.0011100Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:18.0011389Z E       ^
2025-05-07T20:33:18.0011880Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:18.0012327Z 
2025-05-07T20:33:18.0012745Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:18.0013251Z 
2025-05-07T20:33:18.0013359Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.0013756Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.0014152Z     T=4096,
2025-05-07T20:33:18.0014339Z     D=7168,
2025-05-07T20:33:18.0014526Z     scale_ub=1200.0,
2025-05-07T20:33:18.0014744Z     contiguous=False,
2025-05-07T20:33:18.0014966Z     compiled=True,
2025-05-07T20:33:18.1651903Z )
2025-05-07T20:33:18.1652393Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.1653101Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:18.1653483Z 
2025-05-07T20:33:18.1653608Z     @given(
2025-05-07T20:33:18.1653937Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.1654356Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.1654663Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.1654995Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.1655317Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.1655601Z     )
2025-05-07T20:33:18.1655953Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.1656397Z     def test_silu_mul_quant(
2025-05-07T20:33:18.1656636Z         self,
2025-05-07T20:33:18.1656824Z         T: int,
2025-05-07T20:33:18.1657018Z         D: int,
2025-05-07T20:33:18.1657231Z         scale_ub: Optional[float],
2025-05-07T20:33:18.1657505Z         contiguous: bool,
2025-05-07T20:33:18.1657732Z         compiled: bool,
2025-05-07T20:33:18.1657956Z     ) -> None:
2025-05-07T20:33:18.1658168Z         torch.manual_seed(2025)
2025-05-07T20:33:18.1658405Z     
2025-05-07T20:33:18.1658669Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.1659015Z     
2025-05-07T20:33:18.1659196Z         x_sign = torch.sign(x)
2025-05-07T20:33:18.1659485Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:18.1659791Z         x = x_sign * x_clamp
2025-05-07T20:33:18.1660032Z         x0 = x[:, :D]
2025-05-07T20:33:18.1660242Z         x1 = x[:, D:]
2025-05-07T20:33:18.1660473Z     
2025-05-07T20:33:18.1660659Z         if contiguous:
2025-05-07T20:33:18.1660891Z             x0 = x0.contiguous()
2025-05-07T20:33:18.1661155Z             x1 = x1.contiguous()
2025-05-07T20:33:18.1661398Z     
2025-05-07T20:33:18.1661600Z         if scale_ub is not None:
2025-05-07T20:33:18.1661868Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:18.1662317Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:18.1662625Z             )
2025-05-07T20:33:18.1662828Z         else:
2025-05-07T20:33:18.1663028Z             scale_ub_tensor = None
2025-05-07T20:33:18.1663282Z     
2025-05-07T20:33:18.1663514Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:18.1663816Z             op = silu_mul_quant
2025-05-07T20:33:18.1664061Z             if compiled:
2025-05-07T20:33:18.1664293Z                 op = torch.compile(op)
2025-05-07T20:33:18.1664586Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.1664857Z     
2025-05-07T20:33:18.1665049Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:18.1665212Z 
2025-05-07T20:33:18.1665311Z moe/activation_test.py:117: 
2025-05-07T20:33:18.1665597Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.1665938Z moe/activation_test.py:115: in fn
2025-05-07T20:33:18.1666219Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.1666769Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:18.1667511Z     return fn(*args, **kwargs)
2025-05-07T20:33:18.1668172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:18.1668845Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:18.1669378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:18.1670054Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:18.1670711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:18.1671227Z     kernel = self.compile(
2025-05-07T20:33:18.1671838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:18.1672474Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:18.1672866Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.1673087Z 
2025-05-07T20:33:18.1673291Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d91fb550>
2025-05-07T20:33:18.1674343Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:18.1675686Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d80c5300>}
2025-05-07T20:33:18.1676999Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:18.1678006Z context = <triton._C.libtriton.ir.context object at 0x7f359bdfe230>
2025-05-07T20:33:18.1678305Z 
2025-05-07T20:33:18.1678469Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:18.1678981Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:18.1679627Z                            module_map=module_map)
2025-05-07T20:33:18.1679982Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:18.1680331Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:18.1680583Z E       ^
2025-05-07T20:33:18.1681038Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:18.1681484Z 
2025-05-07T20:33:18.1681908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:18.1682421Z 
2025-05-07T20:33:18.1682572Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.1682983Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.1683375Z     T=128,
2025-05-07T20:33:18.1683557Z     D=7168,
2025-05-07T20:33:18.1683752Z     scale_ub=1200.0,
2025-05-07T20:33:18.1683969Z     contiguous=False,
2025-05-07T20:33:18.1690739Z     compiled=True,
2025-05-07T20:33:18.1690951Z )
2025-05-07T20:33:18.1691266Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.1691756Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:18.1692034Z 
2025-05-07T20:33:18.1692110Z     @given(
2025-05-07T20:33:18.1692337Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.1692641Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.1692943Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.1693259Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.1693574Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.1693929Z     )
2025-05-07T20:33:18.1694321Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.1694769Z     def test_silu_mul_quant(
2025-05-07T20:33:18.1695011Z         self,
2025-05-07T20:33:18.1695208Z         T: int,
2025-05-07T20:33:18.1695389Z         D: int,
2025-05-07T20:33:18.1695603Z         scale_ub: Optional[float],
2025-05-07T20:33:18.1695902Z         contiguous: bool,
2025-05-07T20:33:18.1696156Z         compiled: bool,
2025-05-07T20:33:18.1696365Z     ) -> None:
2025-05-07T20:33:18.1696573Z         torch.manual_seed(2025)
2025-05-07T20:33:18.1696811Z     
2025-05-07T20:33:18.1697071Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.1697403Z     
2025-05-07T20:33:18.1697647Z         x_sign = torch.sign(x)
2025-05-07T20:33:18.1697934Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:18.1698236Z         x = x_sign * x_clamp
2025-05-07T20:33:18.1698475Z         x0 = x[:, :D]
2025-05-07T20:33:18.1698687Z         x1 = x[:, D:]
2025-05-07T20:33:18.1698894Z     
2025-05-07T20:33:18.1699071Z         if contiguous:
2025-05-07T20:33:18.1699289Z             x0 = x0.contiguous()
2025-05-07T20:33:18.1699537Z             x1 = x1.contiguous()
2025-05-07T20:33:18.1699770Z     
2025-05-07T20:33:18.1699953Z         if scale_ub is not None:
2025-05-07T20:33:18.1700219Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:18.1700542Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:18.1700835Z             )
2025-05-07T20:33:18.1701025Z         else:
2025-05-07T20:33:18.1701233Z             scale_ub_tensor = None
2025-05-07T20:33:18.1701479Z     
2025-05-07T20:33:18.1701697Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:18.1702002Z             op = silu_mul_quant
2025-05-07T20:33:18.1702245Z             if compiled:
2025-05-07T20:33:18.1702480Z                 op = torch.compile(op)
2025-05-07T20:33:18.1702772Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.1703039Z     
2025-05-07T20:33:18.1703223Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:18.1703387Z 
2025-05-07T20:33:18.1703485Z moe/activation_test.py:117: 
2025-05-07T20:33:18.1703770Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.1704088Z moe/activation_test.py:115: in fn
2025-05-07T20:33:18.1704363Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.1704917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:18.1705467Z     return fn(*args, **kwargs)
2025-05-07T20:33:18.1706166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:18.1706843Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:18.1707490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:18.1708163Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:18.1708810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:18.1709326Z     kernel = self.compile(
2025-05-07T20:33:18.1709864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:18.1710500Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:18.1710887Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.1711107Z 
2025-05-07T20:33:18.1711316Z self = <triton.compiler.compiler.ASTSource object at 0x7f359be29cd0>
2025-05-07T20:33:18.1712414Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:18.1713841Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d80c6160>}
2025-05-07T20:33:18.1715153Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:18.1716162Z context = <triton._C.libtriton.ir.context object at 0x7f359bdfb330>
2025-05-07T20:33:18.1716447Z 
2025-05-07T20:33:18.1716620Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:18.1717128Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:18.1717631Z                            module_map=module_map)
2025-05-07T20:33:18.1717994Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:18.1718350Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:18.1718596Z E       ^
2025-05-07T20:33:18.1719054Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:18.1719492Z 
2025-05-07T20:33:18.1719912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:18.1720413Z 
2025-05-07T20:33:18.1720520Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.1720914Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.1721313Z     T=2048,
2025-05-07T20:33:18.1721500Z     D=7168,
2025-05-07T20:33:18.1721678Z     scale_ub=None,
2025-05-07T20:33:18.1721894Z     contiguous=True,
2025-05-07T20:33:18.1722116Z     compiled=True,
2025-05-07T20:33:18.3001223Z )
2025-05-07T20:33:18.3001753Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.3002488Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:18.3002852Z 
2025-05-07T20:33:18.3002955Z     @given(
2025-05-07T20:33:18.3003262Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.3003611Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.3003910Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.3004234Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.3004726Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.3005004Z     )
2025-05-07T20:33:18.3005363Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.3005834Z     def test_silu_mul_quant(
2025-05-07T20:33:18.3006077Z         self,
2025-05-07T20:33:18.3006275Z         T: int,
2025-05-07T20:33:18.3006478Z         D: int,
2025-05-07T20:33:18.3006696Z         scale_ub: Optional[float],
2025-05-07T20:33:18.3007085Z         contiguous: bool,
2025-05-07T20:33:18.3007325Z         compiled: bool,
2025-05-07T20:33:18.3007548Z     ) -> None:
2025-05-07T20:33:18.3007762Z         torch.manual_seed(2025)
2025-05-07T20:33:18.3008012Z     
2025-05-07T20:33:18.3008278Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.3008623Z     
2025-05-07T20:33:18.3008813Z         x_sign = torch.sign(x)
2025-05-07T20:33:18.3009098Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:18.3009398Z         x = x_sign * x_clamp
2025-05-07T20:33:18.3009639Z         x0 = x[:, :D]
2025-05-07T20:33:18.3009855Z         x1 = x[:, D:]
2025-05-07T20:33:18.3010054Z     
2025-05-07T20:33:18.3010231Z         if contiguous:
2025-05-07T20:33:18.3010456Z             x0 = x0.contiguous()
2025-05-07T20:33:18.3010695Z             x1 = x1.contiguous()
2025-05-07T20:33:18.3010926Z     
2025-05-07T20:33:18.3011120Z         if scale_ub is not None:
2025-05-07T20:33:18.3011382Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:18.3011718Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:18.3012142Z             )
2025-05-07T20:33:18.3012329Z         else:
2025-05-07T20:33:18.3012537Z             scale_ub_tensor = None
2025-05-07T20:33:18.3012778Z     
2025-05-07T20:33:18.3012992Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:18.3013291Z             op = silu_mul_quant
2025-05-07T20:33:18.3013535Z             if compiled:
2025-05-07T20:33:18.3013775Z                 op = torch.compile(op)
2025-05-07T20:33:18.3014066Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.3014335Z     
2025-05-07T20:33:18.3014519Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:18.3014677Z 
2025-05-07T20:33:18.3014772Z moe/activation_test.py:117: 
2025-05-07T20:33:18.3015145Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.3015467Z moe/activation_test.py:115: in fn
2025-05-07T20:33:18.3015737Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.3016289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:18.3016837Z     return fn(*args, **kwargs)
2025-05-07T20:33:18.3017485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:18.3018147Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:18.3018666Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:18.3019347Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:18.3019988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:18.3020505Z     kernel = self.compile(
2025-05-07T20:33:18.3021054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:18.3021689Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:18.3022070Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.3022290Z 
2025-05-07T20:33:18.3022493Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d814e350>
2025-05-07T20:33:18.3023555Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:18.3024901Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d80c7420>}
2025-05-07T20:33:18.3026350Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:18.3027356Z context = <triton._C.libtriton.ir.context object at 0x7f359bcfc930>
2025-05-07T20:33:18.3027710Z 
2025-05-07T20:33:18.3027872Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:18.3028374Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:18.3028819Z                            module_map=module_map)
2025-05-07T20:33:18.3029171Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:18.3029515Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:18.3029761Z E       ^
2025-05-07T20:33:18.3030203Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:18.3030646Z 
2025-05-07T20:33:18.3031059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:18.3031559Z 
2025-05-07T20:33:18.3031667Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.3032144Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.3032528Z     T=16384,
2025-05-07T20:33:18.3032715Z     D=5120,
2025-05-07T20:33:18.3032905Z     scale_ub=None,
2025-05-07T20:33:18.3033107Z     contiguous=False,
2025-05-07T20:33:18.3033331Z     compiled=False,
2025-05-07T20:33:18.3033531Z )
2025-05-07T20:33:18.3033842Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.3034328Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:18.3034598Z 
2025-05-07T20:33:18.3034678Z     @given(
2025-05-07T20:33:18.3034898Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.3035271Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.3035566Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.3035886Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.3036212Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.3036485Z     )
2025-05-07T20:33:18.3036822Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.3037259Z     def test_silu_mul_quant(
2025-05-07T20:33:18.3037489Z         self,
2025-05-07T20:33:18.3037675Z         T: int,
2025-05-07T20:33:18.3037860Z         D: int,
2025-05-07T20:33:18.3038080Z         scale_ub: Optional[float],
2025-05-07T20:33:18.3038342Z         contiguous: bool,
2025-05-07T20:33:18.3038569Z         compiled: bool,
2025-05-07T20:33:18.3038780Z     ) -> None:
2025-05-07T20:33:18.3038995Z         torch.manual_seed(2025)
2025-05-07T20:33:18.3039225Z     
2025-05-07T20:33:18.3039477Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.3039804Z     
2025-05-07T20:33:18.3039989Z         x_sign = torch.sign(x)
2025-05-07T20:33:18.3040488Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:18.3042479Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:18.3044345Z 
2025-05-07T20:33:18.3044460Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:18.3044677Z 
2025-05-07T20:33:18.3044777Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.3045176Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.3045576Z     T=4096,
2025-05-07T20:33:18.3045758Z     D=7168,
2025-05-07T20:33:18.3046022Z     scale_ub=1200.0,
2025-05-07T20:33:18.3046240Z     contiguous=True,
2025-05-07T20:33:18.3046458Z     compiled=True,
2025-05-07T20:33:18.3046653Z )
2025-05-07T20:33:18.3046955Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.3047429Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:18.3047708Z 
2025-05-07T20:33:18.3047783Z     @given(
2025-05-07T20:33:18.3048006Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.3048305Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.3048602Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.3048922Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.3049237Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.3049515Z     )
2025-05-07T20:33:18.3049861Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.3050298Z     def test_silu_mul_quant(
2025-05-07T20:33:18.3050529Z         self,
2025-05-07T20:33:18.3050833Z         T: int,
2025-05-07T20:33:18.3051019Z         D: int,
2025-05-07T20:33:18.3051231Z         scale_ub: Optional[float],
2025-05-07T20:33:18.3051493Z         contiguous: bool,
2025-05-07T20:33:18.3051723Z         compiled: bool,
2025-05-07T20:33:18.3051936Z     ) -> None:
2025-05-07T20:33:18.3052145Z         torch.manual_seed(2025)
2025-05-07T20:33:18.3052382Z     
2025-05-07T20:33:18.3052638Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.3052979Z     
2025-05-07T20:33:18.3053157Z         x_sign = torch.sign(x)
2025-05-07T20:33:18.3053435Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:18.3055403Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:18.3057365Z 
2025-05-07T20:33:18.3057480Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:18.3057695Z 
2025-05-07T20:33:18.3057792Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.3058190Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.3058584Z     T=16384,
2025-05-07T20:33:18.3058767Z     D=7168,
2025-05-07T20:33:18.3058952Z     scale_ub=None,
2025-05-07T20:33:18.3059152Z     contiguous=False,
2025-05-07T20:33:18.3059372Z     compiled=False,
2025-05-07T20:33:18.3059574Z )
2025-05-07T20:33:18.3059882Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.3060380Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:18.3060659Z 
2025-05-07T20:33:18.3060739Z     @given(
2025-05-07T20:33:18.3060976Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.3061272Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.3061568Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.3061889Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.3062200Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.3062478Z     )
2025-05-07T20:33:18.3062827Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.3063266Z     def test_silu_mul_quant(
2025-05-07T20:33:18.3063498Z         self,
2025-05-07T20:33:18.3063689Z         T: int,
2025-05-07T20:33:18.3063878Z         D: int,
2025-05-07T20:33:18.3064090Z         scale_ub: Optional[float],
2025-05-07T20:33:18.3064347Z         contiguous: bool,
2025-05-07T20:33:18.3064642Z         compiled: bool,
2025-05-07T20:33:18.3064852Z     ) -> None:
2025-05-07T20:33:18.3065058Z         torch.manual_seed(2025)
2025-05-07T20:33:18.3065291Z     
2025-05-07T20:33:18.3065545Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.3067662Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:18.3069492Z 
2025-05-07T20:33:18.3069605Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:18.4319065Z 
2025-05-07T20:33:18.4319437Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.4320064Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.4320466Z     T=2048,
2025-05-07T20:33:18.4320642Z     D=7168,
2025-05-07T20:33:18.4320818Z     scale_ub=1200.0,
2025-05-07T20:33:18.4321019Z     contiguous=True,
2025-05-07T20:33:18.4321221Z     compiled=True,
2025-05-07T20:33:18.4321407Z )
2025-05-07T20:33:18.4321703Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.4322187Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:18.4322472Z 
2025-05-07T20:33:18.4322543Z     @given(
2025-05-07T20:33:18.4322759Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.4323053Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.4323424Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.4323748Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.4324072Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.4324357Z     )
2025-05-07T20:33:18.4324695Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.4325127Z     def test_silu_mul_quant(
2025-05-07T20:33:18.4325368Z         self,
2025-05-07T20:33:18.4325560Z         T: int,
2025-05-07T20:33:18.4325754Z         D: int,
2025-05-07T20:33:18.4325977Z         scale_ub: Optional[float],
2025-05-07T20:33:18.4326250Z         contiguous: bool,
2025-05-07T20:33:18.4326480Z         compiled: bool,
2025-05-07T20:33:18.4326696Z     ) -> None:
2025-05-07T20:33:18.4326906Z         torch.manual_seed(2025)
2025-05-07T20:33:18.4327148Z     
2025-05-07T20:33:18.4327410Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.4327745Z     
2025-05-07T20:33:18.4327965Z         x_sign = torch.sign(x)
2025-05-07T20:33:18.4328250Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:18.4330233Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:18.4332172Z 
2025-05-07T20:33:18.4332292Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:18.4332499Z 
2025-05-07T20:33:18.4332593Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.4332987Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.4333380Z     T=2048,
2025-05-07T20:33:18.4333554Z     D=7168,
2025-05-07T20:33:18.4333740Z     scale_ub=None,
2025-05-07T20:33:18.4334018Z     contiguous=True,
2025-05-07T20:33:18.4334236Z     compiled=False,
2025-05-07T20:33:18.4334429Z )
2025-05-07T20:33:18.4334736Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.4335217Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:18.4335478Z 
2025-05-07T20:33:18.4335549Z     @given(
2025-05-07T20:33:18.4335767Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.4336067Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.4336355Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.4336672Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.4336985Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.4337259Z     )
2025-05-07T20:33:18.4337609Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.4338049Z     def test_silu_mul_quant(
2025-05-07T20:33:18.4338279Z         self,
2025-05-07T20:33:18.4338498Z         T: int,
2025-05-07T20:33:18.4338720Z         D: int,
2025-05-07T20:33:18.4338932Z         scale_ub: Optional[float],
2025-05-07T20:33:18.4339189Z         contiguous: bool,
2025-05-07T20:33:18.4339415Z         compiled: bool,
2025-05-07T20:33:18.4339629Z     ) -> None:
2025-05-07T20:33:18.4339838Z         torch.manual_seed(2025)
2025-05-07T20:33:18.4340337Z     
2025-05-07T20:33:18.4340603Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.4340926Z     
2025-05-07T20:33:18.4341117Z >       x_sign = torch.sign(x)
2025-05-07T20:33:18.4343159Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:18.4345059Z 
2025-05-07T20:33:18.4345183Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:18.4345394Z 
2025-05-07T20:33:18.4345498Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.4345894Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.4346304Z     T=1,
2025-05-07T20:33:18.4346475Z     D=7168,
2025-05-07T20:33:18.4346652Z     scale_ub=1200.0,
2025-05-07T20:33:18.4346862Z     contiguous=True,
2025-05-07T20:33:18.4347071Z     compiled=False,
2025-05-07T20:33:18.4347258Z )
2025-05-07T20:33:18.4347617Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.4348092Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:18.4348345Z 
2025-05-07T20:33:18.4348420Z     @given(
2025-05-07T20:33:18.4348634Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.4348928Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.4349220Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.4349534Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.4349851Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.4350117Z     )
2025-05-07T20:33:18.4350450Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.4350893Z     def test_silu_mul_quant(
2025-05-07T20:33:18.4351138Z         self,
2025-05-07T20:33:18.4351325Z         T: int,
2025-05-07T20:33:18.4351521Z         D: int,
2025-05-07T20:33:18.4351743Z         scale_ub: Optional[float],
2025-05-07T20:33:18.4352004Z         contiguous: bool,
2025-05-07T20:33:18.4352247Z         compiled: bool,
2025-05-07T20:33:18.4352462Z     ) -> None:
2025-05-07T20:33:18.4352739Z         torch.manual_seed(2025)
2025-05-07T20:33:18.4352974Z     
2025-05-07T20:33:18.4353237Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.4353558Z     
2025-05-07T20:33:18.4353733Z         x_sign = torch.sign(x)
2025-05-07T20:33:18.4354010Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:18.4354302Z         x = x_sign * x_clamp
2025-05-07T20:33:18.4354522Z         x0 = x[:, :D]
2025-05-07T20:33:18.4354729Z         x1 = x[:, D:]
2025-05-07T20:33:18.4354923Z     
2025-05-07T20:33:18.4355093Z         if contiguous:
2025-05-07T20:33:18.4355317Z             x0 = x0.contiguous()
2025-05-07T20:33:18.4355558Z             x1 = x1.contiguous()
2025-05-07T20:33:18.4355816Z     
2025-05-07T20:33:18.4356016Z         if scale_ub is not None:
2025-05-07T20:33:18.4356275Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:18.4356603Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:18.4356900Z             )
2025-05-07T20:33:18.4357080Z         else:
2025-05-07T20:33:18.4357277Z             scale_ub_tensor = None
2025-05-07T20:33:18.4357637Z     
2025-05-07T20:33:18.4357858Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:18.4358166Z             op = silu_mul_quant
2025-05-07T20:33:18.4365589Z             if compiled:
2025-05-07T20:33:18.4365850Z                 op = torch.compile(op)
2025-05-07T20:33:18.4366142Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.4366407Z     
2025-05-07T20:33:18.4366593Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:18.4366753Z 
2025-05-07T20:33:18.4366847Z moe/activation_test.py:117: 
2025-05-07T20:33:18.4367135Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.4367459Z moe/activation_test.py:115: in fn
2025-05-07T20:33:18.4367805Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.4368504Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:18.4369185Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:18.4369714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:18.4370373Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:18.4371033Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:18.4371552Z     kernel = self.compile(
2025-05-07T20:33:18.4372083Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:18.4372718Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:18.4373104Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.4373328Z 
2025-05-07T20:33:18.4373535Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d93cb0d0>
2025-05-07T20:33:18.4374602Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:18.4375952Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f359bc162a0>}
2025-05-07T20:33:18.4377266Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:18.4378323Z context = <triton._C.libtriton.ir.context object at 0x7f359bebf7f0>
2025-05-07T20:33:18.4378603Z 
2025-05-07T20:33:18.4378772Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:18.4379334Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:18.4379808Z                            module_map=module_map)
2025-05-07T20:33:18.4380163Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:18.4380507Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:18.4380751Z E       ^
2025-05-07T20:33:18.4381196Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:18.4381632Z 
2025-05-07T20:33:18.4382049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:18.4382546Z 
2025-05-07T20:33:18.4382640Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.4383042Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.4383429Z     T=128,
2025-05-07T20:33:18.4383606Z     D=5120,
2025-05-07T20:33:18.4383784Z     scale_ub=None,
2025-05-07T20:33:18.4383994Z     contiguous=True,
2025-05-07T20:33:18.4384206Z     compiled=False,
2025-05-07T20:33:18.4384447Z )
2025-05-07T20:33:18.4384788Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.4385266Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:18.4385530Z 
2025-05-07T20:33:18.4385607Z     @given(
2025-05-07T20:33:18.4385831Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.4386173Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.4386462Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.4386777Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.4387092Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.4387367Z     )
2025-05-07T20:33:18.4387763Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.4388241Z     def test_silu_mul_quant(
2025-05-07T20:33:18.4388471Z         self,
2025-05-07T20:33:18.4388655Z         T: int,
2025-05-07T20:33:18.4388846Z         D: int,
2025-05-07T20:33:18.4389054Z         scale_ub: Optional[float],
2025-05-07T20:33:18.4389309Z         contiguous: bool,
2025-05-07T20:33:18.4389538Z         compiled: bool,
2025-05-07T20:33:18.4389752Z     ) -> None:
2025-05-07T20:33:18.4389949Z         torch.manual_seed(2025)
2025-05-07T20:33:18.4390180Z     
2025-05-07T20:33:18.4390443Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.4390770Z     
2025-05-07T20:33:18.4390958Z         x_sign = torch.sign(x)
2025-05-07T20:33:18.4391234Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:18.4391526Z         x = x_sign * x_clamp
2025-05-07T20:33:18.4391754Z         x0 = x[:, :D]
2025-05-07T20:33:18.4391958Z         x1 = x[:, D:]
2025-05-07T20:33:18.4392162Z     
2025-05-07T20:33:18.4392335Z         if contiguous:
2025-05-07T20:33:18.4392557Z             x0 = x0.contiguous()
2025-05-07T20:33:18.4392800Z             x1 = x1.contiguous()
2025-05-07T20:33:18.4393020Z     
2025-05-07T20:33:18.4393205Z         if scale_ub is not None:
2025-05-07T20:33:18.4393471Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:18.4393790Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:18.4394087Z             )
2025-05-07T20:33:18.4394272Z         else:
2025-05-07T20:33:18.4394468Z             scale_ub_tensor = None
2025-05-07T20:33:18.4394705Z     
2025-05-07T20:33:18.4394925Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:18.4395223Z             op = silu_mul_quant
2025-05-07T20:33:18.4395461Z             if compiled:
2025-05-07T20:33:18.4395699Z                 op = torch.compile(op)
2025-05-07T20:33:18.4395982Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.4396251Z     
2025-05-07T20:33:18.4396434Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:18.4396591Z 
2025-05-07T20:33:18.4396686Z moe/activation_test.py:117: 
2025-05-07T20:33:18.4397008Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.4397334Z moe/activation_test.py:115: in fn
2025-05-07T20:33:18.4397609Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.4398277Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:18.4398951Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:18.4399478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:18.4400138Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:18.4400781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:18.4401304Z     kernel = self.compile(
2025-05-07T20:33:18.4401832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:18.4402515Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:18.4402942Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.4403167Z 
2025-05-07T20:33:18.4403366Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d81344d0>
2025-05-07T20:33:18.4404420Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:18.4405771Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f359bc171a0>}
2025-05-07T20:33:18.4407121Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:18.4408125Z context = <triton._C.libtriton.ir.context object at 0x7f359beecd30>
2025-05-07T20:33:18.4408403Z 
2025-05-07T20:33:18.4408568Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:18.4409076Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:18.4409537Z                            module_map=module_map)
2025-05-07T20:33:18.4409888Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:18.4410234Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:18.4410479Z E       ^
2025-05-07T20:33:18.4410927Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:18.4411370Z 
2025-05-07T20:33:18.4411789Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:18.5545457Z 
2025-05-07T20:33:18.5545709Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.5546129Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.5546550Z     T=128,
2025-05-07T20:33:18.5546733Z     D=7168,
2025-05-07T20:33:18.5546970Z     scale_ub=None,
2025-05-07T20:33:18.5547173Z     contiguous=True,
2025-05-07T20:33:18.5547399Z     compiled=False,
2025-05-07T20:33:18.5547674Z )
2025-05-07T20:33:18.5547983Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.5548459Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:18.5548721Z 
2025-05-07T20:33:18.5548796Z     @given(
2025-05-07T20:33:18.5549017Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.5549313Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.5549615Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.5549939Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.5550369Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.5550654Z     )
2025-05-07T20:33:18.5550988Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.5551411Z     def test_silu_mul_quant(
2025-05-07T20:33:18.5551642Z         self,
2025-05-07T20:33:18.5551834Z         T: int,
2025-05-07T20:33:18.5552016Z         D: int,
2025-05-07T20:33:18.5552229Z         scale_ub: Optional[float],
2025-05-07T20:33:18.5552490Z         contiguous: bool,
2025-05-07T20:33:18.5552722Z         compiled: bool,
2025-05-07T20:33:18.5552933Z     ) -> None:
2025-05-07T20:33:18.5553138Z         torch.manual_seed(2025)
2025-05-07T20:33:18.5553367Z     
2025-05-07T20:33:18.5553626Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.5553955Z     
2025-05-07T20:33:18.5554138Z         x_sign = torch.sign(x)
2025-05-07T20:33:18.5554417Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:18.5554720Z         x = x_sign * x_clamp
2025-05-07T20:33:18.5555026Z         x0 = x[:, :D]
2025-05-07T20:33:18.5555285Z         x1 = x[:, D:]
2025-05-07T20:33:18.5555485Z     
2025-05-07T20:33:18.5555659Z         if contiguous:
2025-05-07T20:33:18.5555903Z             x0 = x0.contiguous()
2025-05-07T20:33:18.5556176Z             x1 = x1.contiguous()
2025-05-07T20:33:18.5556407Z     
2025-05-07T20:33:18.5556587Z         if scale_ub is not None:
2025-05-07T20:33:18.5556856Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:18.5557185Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:18.5557484Z             )
2025-05-07T20:33:18.5557673Z         else:
2025-05-07T20:33:18.5557878Z             scale_ub_tensor = None
2025-05-07T20:33:18.5558123Z     
2025-05-07T20:33:18.5558346Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:18.5558720Z             op = silu_mul_quant
2025-05-07T20:33:18.5558964Z             if compiled:
2025-05-07T20:33:18.5559204Z                 op = torch.compile(op)
2025-05-07T20:33:18.5559495Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.5559756Z     
2025-05-07T20:33:18.5559935Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:18.5560102Z 
2025-05-07T20:33:18.5560199Z moe/activation_test.py:117: 
2025-05-07T20:33:18.5560492Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.5560812Z moe/activation_test.py:115: in fn
2025-05-07T20:33:18.5561077Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.5561763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:18.5562449Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:18.5562974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:18.5563646Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:18.5564295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:18.5564820Z     kernel = self.compile(
2025-05-07T20:33:18.5565367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:18.5566011Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:18.5566402Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.5566628Z 
2025-05-07T20:33:18.5566837Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d93c8cd0>
2025-05-07T20:33:18.5567903Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:18.5569307Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f359be10040>}
2025-05-07T20:33:18.5570626Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:18.5571638Z context = <triton._C.libtriton.ir.context object at 0x7f359be83470>
2025-05-07T20:33:18.5571919Z 
2025-05-07T20:33:18.5572080Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:18.5572598Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:18.5573061Z                            module_map=module_map)
2025-05-07T20:33:18.5573424Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:18.5573765Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:18.5574019Z E       ^
2025-05-07T20:33:18.5574582Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:18.5575060Z 
2025-05-07T20:33:18.5575486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:18.5576037Z 
2025-05-07T20:33:18.5576150Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.5576551Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.5576951Z     T=2048,
2025-05-07T20:33:18.5577130Z     D=7168,
2025-05-07T20:33:18.5577316Z     scale_ub=1200.0,
2025-05-07T20:33:18.5577530Z     contiguous=True,
2025-05-07T20:33:18.5577740Z     compiled=False,
2025-05-07T20:33:18.5577946Z )
2025-05-07T20:33:18.5578258Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.5578780Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:18.5579052Z 
2025-05-07T20:33:18.5579131Z     @given(
2025-05-07T20:33:18.5579361Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.5579671Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.5579971Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.5580297Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.5580618Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.5580889Z     )
2025-05-07T20:33:18.5581230Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.5581660Z     def test_silu_mul_quant(
2025-05-07T20:33:18.5581891Z         self,
2025-05-07T20:33:18.5582089Z         T: int,
2025-05-07T20:33:18.5582290Z         D: int,
2025-05-07T20:33:18.5582497Z         scale_ub: Optional[float],
2025-05-07T20:33:18.5582770Z         contiguous: bool,
2025-05-07T20:33:18.5583006Z         compiled: bool,
2025-05-07T20:33:18.5583223Z     ) -> None:
2025-05-07T20:33:18.5583443Z         torch.manual_seed(2025)
2025-05-07T20:33:18.5583685Z     
2025-05-07T20:33:18.5583961Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.5586007Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:18.5587928Z 
2025-05-07T20:33:18.5588042Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:18.5588259Z 
2025-05-07T20:33:18.5588360Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.5588817Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.5589208Z     T=1,
2025-05-07T20:33:18.5589401Z     D=5120,
2025-05-07T20:33:18.5589597Z     scale_ub=1200.0,
2025-05-07T20:33:18.5589819Z     contiguous=True,
2025-05-07T20:33:18.5590035Z     compiled=False,
2025-05-07T20:33:18.5590236Z )
2025-05-07T20:33:18.5590546Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.5591019Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:18.5591286Z 
2025-05-07T20:33:18.5591362Z     @given(
2025-05-07T20:33:18.5591588Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.5591885Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.5592177Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.5592496Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.5592809Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.5593083Z     )
2025-05-07T20:33:18.5593470Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.5593942Z     def test_silu_mul_quant(
2025-05-07T20:33:18.5594176Z         self,
2025-05-07T20:33:18.5594368Z         T: int,
2025-05-07T20:33:18.5594565Z         D: int,
2025-05-07T20:33:18.5594779Z         scale_ub: Optional[float],
2025-05-07T20:33:18.5595048Z         contiguous: bool,
2025-05-07T20:33:18.5595282Z         compiled: bool,
2025-05-07T20:33:18.5595487Z     ) -> None:
2025-05-07T20:33:18.5595691Z         torch.manual_seed(2025)
2025-05-07T20:33:18.5595915Z     
2025-05-07T20:33:18.5596168Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.5596494Z     
2025-05-07T20:33:18.5596674Z         x_sign = torch.sign(x)
2025-05-07T20:33:18.5596948Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:18.5597286Z         x = x_sign * x_clamp
2025-05-07T20:33:18.5597520Z         x0 = x[:, :D]
2025-05-07T20:33:18.5597723Z         x1 = x[:, D:]
2025-05-07T20:33:18.5597918Z     
2025-05-07T20:33:18.5598101Z         if contiguous:
2025-05-07T20:33:18.5598322Z             x0 = x0.contiguous()
2025-05-07T20:33:18.5598575Z             x1 = x1.contiguous()
2025-05-07T20:33:18.5598820Z     
2025-05-07T20:33:18.5599010Z         if scale_ub is not None:
2025-05-07T20:33:18.5599277Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:18.5599604Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:18.5599903Z             )
2025-05-07T20:33:18.5600086Z         else:
2025-05-07T20:33:18.5600294Z             scale_ub_tensor = None
2025-05-07T20:33:18.5600538Z     
2025-05-07T20:33:18.5600761Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:18.5601068Z             op = silu_mul_quant
2025-05-07T20:33:18.5601323Z             if compiled:
2025-05-07T20:33:18.5601558Z                 op = torch.compile(op)
2025-05-07T20:33:18.5601850Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.5602121Z     
2025-05-07T20:33:18.5602308Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:18.5602475Z 
2025-05-07T20:33:18.5602569Z moe/activation_test.py:117: 
2025-05-07T20:33:18.5602866Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.5603194Z moe/activation_test.py:115: in fn
2025-05-07T20:33:18.5603462Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.5604137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:18.5604811Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:18.5605355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:18.5606030Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:18.5606733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:18.5607253Z     kernel = self.compile(
2025-05-07T20:33:18.5607787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:18.5608426Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:18.5608817Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.5609043Z 
2025-05-07T20:33:18.5609252Z self = <triton.compiler.compiler.ASTSource object at 0x7f359bda46d0>
2025-05-07T20:33:18.5610307Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:18.5611698Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f359be11580>}
2025-05-07T20:33:18.5613053Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:18.5614090Z context = <triton._C.libtriton.ir.context object at 0x7f359b991bb0>
2025-05-07T20:33:18.5614368Z 
2025-05-07T20:33:18.5614533Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:18.5615039Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:18.5615499Z                            module_map=module_map)
2025-05-07T20:33:18.5615867Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:18.5616250Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:18.5616545Z E       ^
2025-05-07T20:33:18.5616994Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:18.5617441Z 
2025-05-07T20:33:18.5617856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:18.6449849Z 
2025-05-07T20:33:18.6450090Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.6450520Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.6450929Z     T=2048,
2025-05-07T20:33:18.6451108Z     D=5120,
2025-05-07T20:33:18.6451307Z     scale_ub=None,
2025-05-07T20:33:18.6451520Z     contiguous=True,
2025-05-07T20:33:18.6451737Z     compiled=False,
2025-05-07T20:33:18.6451938Z )
2025-05-07T20:33:18.6452256Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.6452755Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:18.6453024Z 
2025-05-07T20:33:18.6453098Z     @given(
2025-05-07T20:33:18.6453358Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.6453664Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.6453966Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.6454292Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.6454606Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.6454884Z     )
2025-05-07T20:33:18.6455226Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.6455657Z     def test_silu_mul_quant(
2025-05-07T20:33:18.6455892Z         self,
2025-05-07T20:33:18.6456077Z         T: int,
2025-05-07T20:33:18.6456265Z         D: int,
2025-05-07T20:33:18.6456476Z         scale_ub: Optional[float],
2025-05-07T20:33:18.6456738Z         contiguous: bool,
2025-05-07T20:33:18.6456966Z         compiled: bool,
2025-05-07T20:33:18.6457187Z     ) -> None:
2025-05-07T20:33:18.6457399Z         torch.manual_seed(2025)
2025-05-07T20:33:18.6457647Z     
2025-05-07T20:33:18.6458029Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.6458369Z     
2025-05-07T20:33:18.6458566Z >       x_sign = torch.sign(x)
2025-05-07T20:33:18.6460573Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:18.6462495Z 
2025-05-07T20:33:18.6462610Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:18.6462826Z 
2025-05-07T20:33:18.6462926Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.6463336Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.6463735Z     T=16384,
2025-05-07T20:33:18.6463987Z     D=5120,
2025-05-07T20:33:18.6464218Z     scale_ub=None,
2025-05-07T20:33:18.6464425Z     contiguous=True,
2025-05-07T20:33:18.6464635Z     compiled=False,
2025-05-07T20:33:18.6464829Z )
2025-05-07T20:33:18.6465137Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.6465613Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:18.6465923Z 
2025-05-07T20:33:18.6466011Z     @given(
2025-05-07T20:33:18.6466246Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.6466548Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.6466846Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.6467169Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.6467639Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.6467911Z     )
2025-05-07T20:33:18.6468259Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.6468707Z     def test_silu_mul_quant(
2025-05-07T20:33:18.6468935Z         self,
2025-05-07T20:33:18.6469124Z         T: int,
2025-05-07T20:33:18.6469318Z         D: int,
2025-05-07T20:33:18.6469523Z         scale_ub: Optional[float],
2025-05-07T20:33:18.6469791Z         contiguous: bool,
2025-05-07T20:33:18.6470030Z         compiled: bool,
2025-05-07T20:33:18.6470244Z     ) -> None:
2025-05-07T20:33:18.6470458Z         torch.manual_seed(2025)
2025-05-07T20:33:18.6477638Z     
2025-05-07T20:33:18.6477938Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.6479994Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:18.6481868Z 
2025-05-07T20:33:18.6481987Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:18.6482204Z 
2025-05-07T20:33:18.6482304Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.6482706Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.6483107Z     T=4096,
2025-05-07T20:33:18.6483293Z     D=5120,
2025-05-07T20:33:18.6483487Z     scale_ub=None,
2025-05-07T20:33:18.6483694Z     contiguous=True,
2025-05-07T20:33:18.6483917Z     compiled=False,
2025-05-07T20:33:18.6484123Z )
2025-05-07T20:33:18.6484437Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.6484917Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:18.6485250Z 
2025-05-07T20:33:18.6485338Z     @given(
2025-05-07T20:33:18.6485565Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.6485867Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.6486165Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.6486484Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.6486801Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.6487080Z     )
2025-05-07T20:33:18.6487421Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.6487860Z     def test_silu_mul_quant(
2025-05-07T20:33:18.6488101Z         self,
2025-05-07T20:33:18.6488289Z         T: int,
2025-05-07T20:33:18.6488476Z         D: int,
2025-05-07T20:33:18.6488691Z         scale_ub: Optional[float],
2025-05-07T20:33:18.6488965Z         contiguous: bool,
2025-05-07T20:33:18.6489193Z         compiled: bool,
2025-05-07T20:33:18.6489412Z     ) -> None:
2025-05-07T20:33:18.6489622Z         torch.manual_seed(2025)
2025-05-07T20:33:18.6489904Z     
2025-05-07T20:33:18.6490203Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.6492201Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:18.6494100Z 
2025-05-07T20:33:18.6494253Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:18.6494456Z 
2025-05-07T20:33:18.6494556Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.6494958Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.6495362Z     T=2048,
2025-05-07T20:33:18.6495548Z     D=5120,
2025-05-07T20:33:18.6495728Z     scale_ub=None,
2025-05-07T20:33:18.6495957Z     contiguous=False,
2025-05-07T20:33:18.6496198Z     compiled=False,
2025-05-07T20:33:18.6496393Z )
2025-05-07T20:33:18.6496693Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.6497167Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:18.6497431Z 
2025-05-07T20:33:18.6497512Z     @given(
2025-05-07T20:33:18.6497727Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.6498029Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.6498324Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.6498643Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.6498966Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.6499246Z     )
2025-05-07T20:33:18.6499598Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.6500038Z     def test_silu_mul_quant(
2025-05-07T20:33:18.6500273Z         self,
2025-05-07T20:33:18.6500453Z         T: int,
2025-05-07T20:33:18.6500642Z         D: int,
2025-05-07T20:33:18.6500851Z         scale_ub: Optional[float],
2025-05-07T20:33:18.6501110Z         contiguous: bool,
2025-05-07T20:33:18.6501340Z         compiled: bool,
2025-05-07T20:33:18.6501553Z     ) -> None:
2025-05-07T20:33:18.6501764Z         torch.manual_seed(2025)
2025-05-07T20:33:18.6501996Z     
2025-05-07T20:33:18.6502251Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.6504312Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:18.6506186Z 
2025-05-07T20:33:18.6506301Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:18.6506505Z 
2025-05-07T20:33:18.6506606Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.6507001Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.6507390Z     T=4096,
2025-05-07T20:33:18.6507629Z     D=7168,
2025-05-07T20:33:18.6507808Z     scale_ub=None,
2025-05-07T20:33:18.6508014Z     contiguous=True,
2025-05-07T20:33:18.6508231Z     compiled=True,
2025-05-07T20:33:18.6508422Z )
2025-05-07T20:33:18.6508731Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.6509207Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:18.6509552Z 
2025-05-07T20:33:18.6509633Z     @given(
2025-05-07T20:33:18.6509850Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.6510154Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.6510451Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.6510765Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.6511082Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.6511364Z     )
2025-05-07T20:33:18.6511708Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.6512141Z     def test_silu_mul_quant(
2025-05-07T20:33:18.6512379Z         self,
2025-05-07T20:33:18.6512562Z         T: int,
2025-05-07T20:33:18.6512797Z         D: int,
2025-05-07T20:33:18.6513009Z         scale_ub: Optional[float],
2025-05-07T20:33:18.6513266Z         contiguous: bool,
2025-05-07T20:33:18.6513504Z         compiled: bool,
2025-05-07T20:33:18.6513724Z     ) -> None:
2025-05-07T20:33:18.6513934Z         torch.manual_seed(2025)
2025-05-07T20:33:18.6514170Z     
2025-05-07T20:33:18.6514431Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.6516446Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:18.6518274Z 
2025-05-07T20:33:18.6518394Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:18.6518600Z 
2025-05-07T20:33:18.6518703Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.6519110Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.6519511Z     T=2048,
2025-05-07T20:33:18.6519693Z     D=5120,
2025-05-07T20:33:18.6519871Z     scale_ub=1200.0,
2025-05-07T20:33:18.6520089Z     contiguous=False,
2025-05-07T20:33:18.6520309Z     compiled=False,
2025-05-07T20:33:18.7066230Z )
2025-05-07T20:33:18.7066903Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.7067719Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:18.7068093Z 
2025-05-07T20:33:18.7068200Z     @given(
2025-05-07T20:33:18.7068504Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.7068912Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.7069303Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.7069653Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.7070093Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.7070375Z     )
2025-05-07T20:33:18.7070722Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.7071178Z     def test_silu_mul_quant(
2025-05-07T20:33:18.7071422Z         self,
2025-05-07T20:33:18.7071605Z         T: int,
2025-05-07T20:33:18.7071795Z         D: int,
2025-05-07T20:33:18.7072013Z         scale_ub: Optional[float],
2025-05-07T20:33:18.7072306Z         contiguous: bool,
2025-05-07T20:33:18.7072537Z         compiled: bool,
2025-05-07T20:33:18.7072747Z     ) -> None:
2025-05-07T20:33:18.7072948Z         torch.manual_seed(2025)
2025-05-07T20:33:18.7073182Z     
2025-05-07T20:33:18.7073444Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.7075527Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:18.7077458Z 
2025-05-07T20:33:18.7077574Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:18.7077781Z 
2025-05-07T20:33:18.7077887Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.7078294Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.7078678Z     T=4096,
2025-05-07T20:33:18.7078856Z     D=7168,
2025-05-07T20:33:18.7079037Z     scale_ub=1200.0,
2025-05-07T20:33:18.7079307Z     contiguous=True,
2025-05-07T20:33:18.7079516Z     compiled=False,
2025-05-07T20:33:18.7079713Z )
2025-05-07T20:33:18.7080023Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.7080512Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:18.7080777Z 
2025-05-07T20:33:18.7080853Z     @given(
2025-05-07T20:33:18.7081065Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.7081361Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.7081662Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.7081979Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.7082288Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.7082559Z     )
2025-05-07T20:33:18.7082893Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.7083333Z     def test_silu_mul_quant(
2025-05-07T20:33:18.7083570Z         self,
2025-05-07T20:33:18.7083751Z         T: int,
2025-05-07T20:33:18.7083933Z         D: int,
2025-05-07T20:33:18.7084140Z         scale_ub: Optional[float],
2025-05-07T20:33:18.7084401Z         contiguous: bool,
2025-05-07T20:33:18.7084629Z         compiled: bool,
2025-05-07T20:33:18.7084838Z     ) -> None:
2025-05-07T20:33:18.7085042Z         torch.manual_seed(2025)
2025-05-07T20:33:18.7085259Z     
2025-05-07T20:33:18.7085513Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.7087523Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:18.7089385Z 
2025-05-07T20:33:18.7089547Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:18.7089756Z 
2025-05-07T20:33:18.7089859Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.7090250Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.7090647Z     T=16384,
2025-05-07T20:33:18.7090830Z     D=7168,
2025-05-07T20:33:18.7091004Z     scale_ub=None,
2025-05-07T20:33:18.7091212Z     contiguous=False,
2025-05-07T20:33:18.7091425Z     compiled=True,
2025-05-07T20:33:18.7091615Z )
2025-05-07T20:33:18.7091918Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.7092394Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:18.7092672Z 
2025-05-07T20:33:18.7092746Z     @given(
2025-05-07T20:33:18.7092961Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.7093273Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.7093562Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.7093873Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.7094265Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.7094531Z     )
2025-05-07T20:33:18.7094861Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.7095297Z     def test_silu_mul_quant(
2025-05-07T20:33:18.7095524Z         self,
2025-05-07T20:33:18.7095700Z         T: int,
2025-05-07T20:33:18.7095882Z         D: int,
2025-05-07T20:33:18.7096092Z         scale_ub: Optional[float],
2025-05-07T20:33:18.7096355Z         contiguous: bool,
2025-05-07T20:33:18.7096574Z         compiled: bool,
2025-05-07T20:33:18.7096779Z     ) -> None:
2025-05-07T20:33:18.7096982Z         torch.manual_seed(2025)
2025-05-07T20:33:18.7097212Z     
2025-05-07T20:33:18.7097518Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.7099531Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:18.7101383Z 
2025-05-07T20:33:18.7101503Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:18.7101710Z 
2025-05-07T20:33:18.7101807Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.7102208Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.7102608Z     T=4096,
2025-05-07T20:33:18.7102795Z     D=7168,
2025-05-07T20:33:18.7102974Z     scale_ub=None,
2025-05-07T20:33:18.7103178Z     contiguous=True,
2025-05-07T20:33:18.7103390Z     compiled=False,
2025-05-07T20:33:18.7103584Z )
2025-05-07T20:33:18.7103893Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.7104372Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:18.7104631Z 
2025-05-07T20:33:18.7104701Z     @given(
2025-05-07T20:33:18.7104919Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.7105224Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.7105512Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.7105829Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.7106146Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.7106421Z     )
2025-05-07T20:33:18.7106755Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.7107201Z     def test_silu_mul_quant(
2025-05-07T20:33:18.7107512Z         self,
2025-05-07T20:33:18.7107693Z         T: int,
2025-05-07T20:33:18.7107927Z         D: int,
2025-05-07T20:33:18.7108136Z         scale_ub: Optional[float],
2025-05-07T20:33:18.7108394Z         contiguous: bool,
2025-05-07T20:33:18.7108622Z         compiled: bool,
2025-05-07T20:33:18.7108836Z     ) -> None:
2025-05-07T20:33:18.7109032Z         torch.manual_seed(2025)
2025-05-07T20:33:18.7109267Z     
2025-05-07T20:33:18.7109530Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.7111528Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:18.7113343Z 
2025-05-07T20:33:18.7113566Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:18.7113769Z 
2025-05-07T20:33:18.7113865Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.7114264Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.7114659Z     T=16384,
2025-05-07T20:33:18.7114837Z     D=7168,
2025-05-07T20:33:18.7115022Z     scale_ub=None,
2025-05-07T20:33:18.7115223Z     contiguous=True,
2025-05-07T20:33:18.7115428Z     compiled=False,
2025-05-07T20:33:18.7115626Z )
2025-05-07T20:33:18.7115931Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.7116405Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:18.7116678Z 
2025-05-07T20:33:18.7116848Z     @given(
2025-05-07T20:33:18.7117068Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.7117375Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.7117666Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.7117990Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.7118303Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.7118575Z     )
2025-05-07T20:33:18.7118913Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.7119353Z     def test_silu_mul_quant(
2025-05-07T20:33:18.7119581Z         self,
2025-05-07T20:33:18.7119770Z         T: int,
2025-05-07T20:33:18.7119958Z         D: int,
2025-05-07T20:33:18.7120163Z         scale_ub: Optional[float],
2025-05-07T20:33:18.7120422Z         contiguous: bool,
2025-05-07T20:33:18.7120651Z         compiled: bool,
2025-05-07T20:33:18.7120853Z     ) -> None:
2025-05-07T20:33:18.7121048Z         torch.manual_seed(2025)
2025-05-07T20:33:18.7121276Z     
2025-05-07T20:33:18.7121534Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.7123523Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:18.7125438Z 
2025-05-07T20:33:18.7125547Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:18.7125755Z 
2025-05-07T20:33:18.7125852Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.7126249Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.7126639Z     T=16384,
2025-05-07T20:33:18.7126814Z     D=7168,
2025-05-07T20:33:18.7126998Z     scale_ub=1200.0,
2025-05-07T20:33:18.7127257Z     contiguous=True,
2025-05-07T20:33:18.7127461Z     compiled=False,
2025-05-07T20:33:18.7127649Z )
2025-05-07T20:33:18.7127945Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.7128413Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:18.7128684Z 
2025-05-07T20:33:18.7128754Z     @given(
2025-05-07T20:33:18.7128962Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.7129255Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.7129540Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.7129855Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.7130160Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.7130421Z     )
2025-05-07T20:33:18.7130753Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.7131191Z     def test_silu_mul_quant(
2025-05-07T20:33:18.7131415Z         self,
2025-05-07T20:33:18.7131603Z         T: int,
2025-05-07T20:33:18.7131875Z         D: int,
2025-05-07T20:33:18.7132080Z         scale_ub: Optional[float],
2025-05-07T20:33:18.7132341Z         contiguous: bool,
2025-05-07T20:33:18.7132568Z         compiled: bool,
2025-05-07T20:33:18.7132772Z     ) -> None:
2025-05-07T20:33:18.7132977Z         torch.manual_seed(2025)
2025-05-07T20:33:18.7133207Z     
2025-05-07T20:33:18.7133457Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.7135465Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:18.7137439Z 
2025-05-07T20:33:18.7137551Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:18.8985965Z 
2025-05-07T20:33:18.8986125Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.8986664Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.8987057Z     T=128,
2025-05-07T20:33:18.8987302Z     D=5120,
2025-05-07T20:33:18.8987620Z     scale_ub=1200.0,
2025-05-07T20:33:18.8987944Z     contiguous=False,
2025-05-07T20:33:18.8988244Z     compiled=False,
2025-05-07T20:33:18.8988461Z )
2025-05-07T20:33:18.8988776Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.8989261Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:18.8989535Z 
2025-05-07T20:33:18.8989612Z     @given(
2025-05-07T20:33:18.8989839Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.8990147Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.8990445Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.8990766Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.8991090Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.8991364Z     )
2025-05-07T20:33:18.8991706Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.8992140Z     def test_silu_mul_quant(
2025-05-07T20:33:18.8992374Z         self,
2025-05-07T20:33:18.8992560Z         T: int,
2025-05-07T20:33:18.8992748Z         D: int,
2025-05-07T20:33:18.8992954Z         scale_ub: Optional[float],
2025-05-07T20:33:18.8993213Z         contiguous: bool,
2025-05-07T20:33:18.8993439Z         compiled: bool,
2025-05-07T20:33:18.8993654Z     ) -> None:
2025-05-07T20:33:18.8993851Z         torch.manual_seed(2025)
2025-05-07T20:33:18.8994081Z     
2025-05-07T20:33:18.8994454Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.8994785Z     
2025-05-07T20:33:18.8994969Z         x_sign = torch.sign(x)
2025-05-07T20:33:18.8995247Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:18.8995540Z         x = x_sign * x_clamp
2025-05-07T20:33:18.8995775Z         x0 = x[:, :D]
2025-05-07T20:33:18.8996019Z         x1 = x[:, D:]
2025-05-07T20:33:18.8996231Z     
2025-05-07T20:33:18.8996409Z         if contiguous:
2025-05-07T20:33:18.8996631Z             x0 = x0.contiguous()
2025-05-07T20:33:18.8996881Z             x1 = x1.contiguous()
2025-05-07T20:33:18.8997113Z     
2025-05-07T20:33:18.8997299Z         if scale_ub is not None:
2025-05-07T20:33:18.8997559Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:18.8997882Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:18.8998186Z             )
2025-05-07T20:33:18.8998372Z         else:
2025-05-07T20:33:18.8998567Z             scale_ub_tensor = None
2025-05-07T20:33:18.8998809Z     
2025-05-07T20:33:18.8999099Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:18.8999451Z             op = silu_mul_quant
2025-05-07T20:33:18.8999691Z             if compiled:
2025-05-07T20:33:18.8999931Z                 op = torch.compile(op)
2025-05-07T20:33:18.9000210Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.9000478Z     
2025-05-07T20:33:18.9000658Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:18.9000817Z 
2025-05-07T20:33:18.9000911Z moe/activation_test.py:117: 
2025-05-07T20:33:18.9001192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.9001506Z moe/activation_test.py:115: in fn
2025-05-07T20:33:18.9001772Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.9002444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:18.9003185Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:18.9003754Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:18.9004427Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:18.9005082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:18.9005598Z     kernel = self.compile(
2025-05-07T20:33:18.9006140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:18.9006781Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:18.9007171Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.9007397Z 
2025-05-07T20:33:18.9007599Z self = <triton.compiler.compiler.ASTSource object at 0x7f359be289d0>
2025-05-07T20:33:18.9008675Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:18.9010022Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f359bbf11c0>}
2025-05-07T20:33:18.9011326Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:18.9012320Z context = <triton._C.libtriton.ir.context object at 0x7f359b8094f0>
2025-05-07T20:33:18.9012599Z 
2025-05-07T20:33:18.9012760Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:18.9013269Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:18.9013780Z                            module_map=module_map)
2025-05-07T20:33:18.9014139Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:18.9014484Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:18.9014731Z E       ^
2025-05-07T20:33:18.9015183Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:18.9015624Z 
2025-05-07T20:33:18.9016038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:18.9016541Z 
2025-05-07T20:33:18.9016639Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.9017047Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.9017433Z     T=2048,
2025-05-07T20:33:18.9017609Z     D=7168,
2025-05-07T20:33:18.9017798Z     scale_ub=None,
2025-05-07T20:33:18.9018004Z     contiguous=False,
2025-05-07T20:33:18.9018225Z     compiled=False,
2025-05-07T20:33:18.9018430Z )
2025-05-07T20:33:18.9018754Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.9019313Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:18.9027210Z 
2025-05-07T20:33:18.9027305Z     @given(
2025-05-07T20:33:18.9027603Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.9027903Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.9028202Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.9028523Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.9028838Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.9029121Z     )
2025-05-07T20:33:18.9029464Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.9029976Z     def test_silu_mul_quant(
2025-05-07T20:33:18.9030205Z         self,
2025-05-07T20:33:18.9030396Z         T: int,
2025-05-07T20:33:18.9030588Z         D: int,
2025-05-07T20:33:18.9030798Z         scale_ub: Optional[float],
2025-05-07T20:33:18.9031066Z         contiguous: bool,
2025-05-07T20:33:18.9031297Z         compiled: bool,
2025-05-07T20:33:18.9031506Z     ) -> None:
2025-05-07T20:33:18.9031714Z         torch.manual_seed(2025)
2025-05-07T20:33:18.9031951Z     
2025-05-07T20:33:18.9032212Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.9034226Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:18.9036183Z 
2025-05-07T20:33:18.9036299Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:18.9036514Z 
2025-05-07T20:33:18.9036613Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:18.9037017Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:18.9037414Z     T=128,
2025-05-07T20:33:18.9037596Z     D=7168,
2025-05-07T20:33:18.9037780Z     scale_ub=1200.0,
2025-05-07T20:33:18.9037987Z     contiguous=True,
2025-05-07T20:33:18.9038198Z     compiled=True,
2025-05-07T20:33:18.9038394Z )
2025-05-07T20:33:18.9038693Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:18.9039164Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:18.9039432Z 
2025-05-07T20:33:18.9039507Z     @given(
2025-05-07T20:33:18.9039729Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:18.9040032Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:18.9040659Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:18.9040994Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:18.9041306Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:18.9041581Z     )
2025-05-07T20:33:18.9041916Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:18.9042353Z     def test_silu_mul_quant(
2025-05-07T20:33:18.9042577Z         self,
2025-05-07T20:33:18.9042768Z         T: int,
2025-05-07T20:33:18.9042953Z         D: int,
2025-05-07T20:33:18.9043161Z         scale_ub: Optional[float],
2025-05-07T20:33:18.9043425Z         contiguous: bool,
2025-05-07T20:33:18.9043659Z         compiled: bool,
2025-05-07T20:33:18.9043880Z     ) -> None:
2025-05-07T20:33:18.9044090Z         torch.manual_seed(2025)
2025-05-07T20:33:18.9044326Z     
2025-05-07T20:33:18.9044585Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:18.9044920Z     
2025-05-07T20:33:18.9045108Z         x_sign = torch.sign(x)
2025-05-07T20:33:18.9045389Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:18.9045813Z         x = x_sign * x_clamp
2025-05-07T20:33:18.9046048Z         x0 = x[:, :D]
2025-05-07T20:33:18.9046249Z         x1 = x[:, D:]
2025-05-07T20:33:18.9046447Z     
2025-05-07T20:33:18.9046621Z         if contiguous:
2025-05-07T20:33:18.9046839Z             x0 = x0.contiguous()
2025-05-07T20:33:18.9047084Z             x1 = x1.contiguous()
2025-05-07T20:33:18.9047315Z     
2025-05-07T20:33:18.9047492Z         if scale_ub is not None:
2025-05-07T20:33:18.9047758Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:18.9048079Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:18.9048371Z             )
2025-05-07T20:33:18.9048558Z         else:
2025-05-07T20:33:18.9048828Z             scale_ub_tensor = None
2025-05-07T20:33:18.9049064Z     
2025-05-07T20:33:18.9049282Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:18.9049586Z             op = silu_mul_quant
2025-05-07T20:33:18.9049837Z             if compiled:
2025-05-07T20:33:18.9050069Z                 op = torch.compile(op)
2025-05-07T20:33:18.9050355Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.9050619Z     
2025-05-07T20:33:18.9050800Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:18.9050961Z 
2025-05-07T20:33:18.9051058Z moe/activation_test.py:117: 
2025-05-07T20:33:18.9051339Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.9051652Z moe/activation_test.py:115: in fn
2025-05-07T20:33:18.9051922Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:18.9052492Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:18.9053045Z     return fn(*args, **kwargs)
2025-05-07T20:33:18.9053695Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:18.9054368Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:18.9054895Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:18.9055553Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:18.9056259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:18.9056785Z     kernel = self.compile(
2025-05-07T20:33:18.9057334Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:18.9057969Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:18.9058358Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:18.9058585Z 
2025-05-07T20:33:18.9058793Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d8330750>
2025-05-07T20:33:18.9059898Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:18.9061239Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f359b85fb00>}
2025-05-07T20:33:18.9062553Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:18.9063554Z context = <triton._C.libtriton.ir.context object at 0x7f359bad1830>
2025-05-07T20:33:18.9063835Z 
2025-05-07T20:33:18.9064003Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:18.9064514Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:18.9065019Z                            module_map=module_map)
2025-05-07T20:33:18.9065414Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:18.9065767Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:18.9066010Z E       ^
2025-05-07T20:33:18.9066600Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:18.9067044Z 
2025-05-07T20:33:18.9067544Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.1876882Z 
2025-05-07T20:33:19.1877157Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.1877576Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.1878094Z     T=128,
2025-05-07T20:33:19.1878293Z     D=7168,
2025-05-07T20:33:19.1878483Z     scale_ub=1200.0,
2025-05-07T20:33:19.1878703Z     contiguous=True,
2025-05-07T20:33:19.1878950Z     compiled=False,
2025-05-07T20:33:19.1879159Z )
2025-05-07T20:33:19.1879480Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.1879970Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:19.1880236Z 
2025-05-07T20:33:19.1880321Z     @given(
2025-05-07T20:33:19.1880547Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.1880860Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.1881165Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.1881490Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.1881806Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.1882083Z     )
2025-05-07T20:33:19.1882425Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.1882856Z     def test_silu_mul_quant(
2025-05-07T20:33:19.1883092Z         self,
2025-05-07T20:33:19.1883289Z         T: int,
2025-05-07T20:33:19.1883474Z         D: int,
2025-05-07T20:33:19.1883692Z         scale_ub: Optional[float],
2025-05-07T20:33:19.1883957Z         contiguous: bool,
2025-05-07T20:33:19.1884189Z         compiled: bool,
2025-05-07T20:33:19.1884412Z     ) -> None:
2025-05-07T20:33:19.1884619Z         torch.manual_seed(2025)
2025-05-07T20:33:19.1884849Z     
2025-05-07T20:33:19.1885114Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.1885444Z     
2025-05-07T20:33:19.1885632Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.1885915Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.1887950Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.1889875Z 
2025-05-07T20:33:19.1889990Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:19.1890200Z 
2025-05-07T20:33:19.1890307Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.1890701Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.1891104Z     T=128,
2025-05-07T20:33:19.1891285Z     D=5120,
2025-05-07T20:33:19.1891470Z     scale_ub=1200.0,
2025-05-07T20:33:19.1891683Z     contiguous=True,
2025-05-07T20:33:19.1891895Z     compiled=True,
2025-05-07T20:33:19.1892089Z )
2025-05-07T20:33:19.1892396Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.1892876Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:19.1893143Z 
2025-05-07T20:33:19.1893232Z     @given(
2025-05-07T20:33:19.1893573Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.1893880Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.1894186Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.1894506Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.1894831Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.1895110Z     )
2025-05-07T20:33:19.1895457Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.1895900Z     def test_silu_mul_quant(
2025-05-07T20:33:19.1896138Z         self,
2025-05-07T20:33:19.1896331Z         T: int,
2025-05-07T20:33:19.1896514Z         D: int,
2025-05-07T20:33:19.1896724Z         scale_ub: Optional[float],
2025-05-07T20:33:19.1897046Z         contiguous: bool,
2025-05-07T20:33:19.1897273Z         compiled: bool,
2025-05-07T20:33:19.1897489Z     ) -> None:
2025-05-07T20:33:19.1897701Z         torch.manual_seed(2025)
2025-05-07T20:33:19.1897944Z     
2025-05-07T20:33:19.1898202Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.1898535Z     
2025-05-07T20:33:19.1898720Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.1899002Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.1900949Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.1902793Z 
2025-05-07T20:33:19.1902908Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:19.1903119Z 
2025-05-07T20:33:19.1903218Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.1903633Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.1904030Z     T=128,
2025-05-07T20:33:19.1904227Z     D=7168,
2025-05-07T20:33:19.1904498Z     scale_ub=None,
2025-05-07T20:33:19.1904782Z     contiguous=True,
2025-05-07T20:33:19.1905100Z     compiled=True,
2025-05-07T20:33:19.1905399Z )
2025-05-07T20:33:19.1905882Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.1906438Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:19.1906809Z 
2025-05-07T20:33:19.1906892Z     @given(
2025-05-07T20:33:19.1907190Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.1907701Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.1908133Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.1908670Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.1909149Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.1909568Z     )
2025-05-07T20:33:19.1910064Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.1910657Z     def test_silu_mul_quant(
2025-05-07T20:33:19.1910984Z         self,
2025-05-07T20:33:19.1911251Z         T: int,
2025-05-07T20:33:19.1911516Z         D: int,
2025-05-07T20:33:19.1911800Z         scale_ub: Optional[float],
2025-05-07T20:33:19.1912158Z         contiguous: bool,
2025-05-07T20:33:19.1912476Z         compiled: bool,
2025-05-07T20:33:19.1912769Z     ) -> None:
2025-05-07T20:33:19.1913054Z         torch.manual_seed(2025)
2025-05-07T20:33:19.1913371Z     
2025-05-07T20:33:19.1913720Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.1916585Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.1919190Z 
2025-05-07T20:33:19.1919355Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.1919648Z 
2025-05-07T20:33:19.1920150Z FAILED
2025-05-07T20:33:19.1920300Z 
2025-05-07T20:33:19.1920475Z =================================== FAILURES ===================================
2025-05-07T20:33:19.1921044Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:33:19.1921678Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:33:19.1922508Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 58, in testPartExecutor
2025-05-07T20:33:19.1923229Z   |     yield
2025-05-07T20:33:19.1923842Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 651, in run
2025-05-07T20:33:19.1924542Z   |     self._callTestMethod(testMethod)
2025-05-07T20:33:19.1924938Z   |     ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:33:19.1925651Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 606, in _callTestMethod
2025-05-07T20:33:19.1926394Z   |     if method() is not None:
2025-05-07T20:33:19.1926755Z   |        ~~~~~~^^
2025-05-07T20:33:19.1927604Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:33:19.1928602Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.1928985Z   |            ^^^^^^^
2025-05-07T20:33:19.1929753Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:33:19.1930593Z   |     raise the_error_hypothesis_found
2025-05-07T20:33:19.1931165Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:33:19.1931727Z   +-+---------------- 1 ----------------
2025-05-07T20:33:19.1932122Z     | Traceback (most recent call last):
2025-05-07T20:33:19.1933089Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:19.1934140Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.1937033Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.1939692Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:19.1940540Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.1941081Z     |     T=2048,
2025-05-07T20:33:19.1941402Z     |     D=5120,  # or any other generated value
2025-05-07T20:33:19.1941859Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:33:19.1942333Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:33:19.1942778Z     |     compiled=False,  # or any other generated value
2025-05-07T20:33:19.1943092Z     | )
2025-05-07T20:33:19.1943269Z     | 
2025-05-07T20:33:19.1943893Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:33:19.1944544Z     +---------------- 2 ----------------
2025-05-07T20:33:19.1944836Z     | Traceback (most recent call last):
2025-05-07T20:33:19.1945535Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:19.1946295Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.1948373Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.1950380Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:19.1950815Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.1951212Z     |     T=128,
2025-05-07T20:33:19.1951415Z     |     D=7168,
2025-05-07T20:33:19.1951624Z     |     scale_ub=None,
2025-05-07T20:33:19.1951859Z     |     contiguous=True,
2025-05-07T20:33:19.1952102Z     |     compiled=True,
2025-05-07T20:33:19.1952329Z     | )
2025-05-07T20:33:19.1952505Z     | 
2025-05-07T20:33:19.1953022Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:33:19.1953622Z     +---------------- 3 ----------------
2025-05-07T20:33:19.1953907Z     | Traceback (most recent call last):
2025-05-07T20:33:19.1954616Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:19.1955390Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.1957843Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.1960500Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:19.1961103Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.1961795Z     |     T=128,
2025-05-07T20:33:19.1962088Z     |     D=5120,
2025-05-07T20:33:19.1962386Z     |     scale_ub=1200.0,
2025-05-07T20:33:19.1962725Z     |     contiguous=True,
2025-05-07T20:33:19.1963061Z     |     compiled=True,
2025-05-07T20:33:19.1963368Z     | )
2025-05-07T20:33:19.1963618Z     | 
2025-05-07T20:33:19.1969630Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:33:19.1970512Z     +---------------- 4 ----------------
2025-05-07T20:33:19.1970907Z     | Traceback (most recent call last):
2025-05-07T20:33:19.1971899Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:33:19.1972883Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:19.1973268Z     |                              ~~~~~~^^
2025-05-07T20:33:19.1974250Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:33:19.1975299Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.1976457Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:33:19.1977530Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:19.1977922Z     |     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
2025-05-07T20:33:19.1978283Z     |         a,
2025-05-07T20:33:19.1978548Z     |         ^^
2025-05-07T20:33:19.1978835Z     |     ...<23 lines>...
2025-05-07T20:33:19.1979166Z     |         USE_INT64=use_int64,
2025-05-07T20:33:19.1979518Z     |         ^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:19.1979925Z     |     )
2025-05-07T20:33:19.1980178Z     |     ^
2025-05-07T20:33:19.1980906Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:33:19.1981931Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.1982558Z     |                                    ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:19.1983426Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:33:19.1984498Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:19.1985138Z     |                        ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:19.1986023Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:33:19.1986977Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:19.1987612Z     |            ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:19.1988450Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:33:19.1989224Z     |     fn()
2025-05-07T20:33:19.1989490Z     |     ~~^^
2025-05-07T20:33:19.1990259Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:33:19.1991140Z     |     self.fn.run(
2025-05-07T20:33:19.1991441Z     |     ~~~~~~~~~~~^
2025-05-07T20:33:19.1991725Z     |         *args,
2025-05-07T20:33:19.1992024Z     |         ^^^^^^
2025-05-07T20:33:19.1992316Z     |         **current,
2025-05-07T20:33:19.1992616Z     |         ^^^^^^^^^^
2025-05-07T20:33:19.1992914Z     |     )
2025-05-07T20:33:19.1993170Z     |     ^
2025-05-07T20:33:19.1993837Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:33:19.1994685Z     |     kernel = self.compile(
2025-05-07T20:33:19.1995041Z     |         src,
2025-05-07T20:33:19.1995322Z     |         target=target,
2025-05-07T20:33:19.2016251Z     |         options=options.__dict__,
2025-05-07T20:33:19.2016633Z     |     )
2025-05-07T20:33:19.2017383Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:33:19.2018342Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2019290Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:33:19.2020330Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2020957Z     |                        module_map=module_map)
2025-05-07T20:33:19.2021435Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2021891Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:33:19.2022245Z     | ^
2025-05-07T20:33:19.2023050Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2023809Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:19.2024334Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:33:19.2025022Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2025591Z     |     T=1,  # or any other generated value
2025-05-07T20:33:19.2026022Z     |     D=5120,  # or any other generated value
2025-05-07T20:33:19.2026502Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:33:19.2026984Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:33:19.2027664Z     |     compiled=True,  # or any other generated value
2025-05-07T20:33:19.2028055Z     | )
2025-05-07T20:33:19.2028293Z     | 
2025-05-07T20:33:19.2029009Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:33:19.2029823Z     +------------------------------------
2025-05-07T20:33:19.2030300Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:33:19.2030804Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2031336Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2031867Z     T=1,
2025-05-07T20:33:19.2032109Z     D=5120,
2025-05-07T20:33:19.2032360Z     scale_ub=None,
2025-05-07T20:33:19.2032646Z     contiguous=True,
2025-05-07T20:33:19.2032946Z     compiled=True,
2025-05-07T20:33:19.2033225Z )
2025-05-07T20:33:19.2033641Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2034283Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:19.2034636Z 
2025-05-07T20:33:19.2034752Z     @given(
2025-05-07T20:33:19.2035054Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2035472Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2035873Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2036313Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2036752Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2037132Z     )
2025-05-07T20:33:19.2037593Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2038177Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2038489Z         self,
2025-05-07T20:33:19.2038740Z         T: int,
2025-05-07T20:33:19.2038986Z         D: int,
2025-05-07T20:33:19.2039261Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2039612Z         contiguous: bool,
2025-05-07T20:33:19.2039916Z         compiled: bool,
2025-05-07T20:33:19.2040526Z     ) -> None:
2025-05-07T20:33:19.2040817Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2041304Z     
2025-05-07T20:33:19.2041675Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2042142Z     
2025-05-07T20:33:19.2042393Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2042774Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2043207Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2043525Z         x0 = x[:, :D]
2025-05-07T20:33:19.2043816Z         x1 = x[:, D:]
2025-05-07T20:33:19.2044104Z     
2025-05-07T20:33:19.2044354Z         if contiguous:
2025-05-07T20:33:19.2044666Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2045016Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2045331Z     
2025-05-07T20:33:19.2045568Z         if scale_ub is not None:
2025-05-07T20:33:19.2045932Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2046387Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2046787Z             )
2025-05-07T20:33:19.2047059Z         else:
2025-05-07T20:33:19.2047352Z             scale_ub_tensor = None
2025-05-07T20:33:19.2047777Z     
2025-05-07T20:33:19.2048160Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2048574Z             op = silu_mul_quant
2025-05-07T20:33:19.2048904Z             if compiled:
2025-05-07T20:33:19.2049233Z                 op = torch.compile(op)
2025-05-07T20:33:19.2049627Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2049983Z     
2025-05-07T20:33:19.2050246Z         y_fp8, y_scale = fn()
2025-05-07T20:33:19.2050624Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:19.2051008Z     
2025-05-07T20:33:19.2051324Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2051763Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:19.2052238Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:19.2052646Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:19.2053126Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.2053540Z     
2025-05-07T20:33:19.2053793Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:19.2054056Z 
2025-05-07T20:33:19.2054182Z moe/activation_test.py:126: 
2025-05-07T20:33:19.2054569Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2054989Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:19.2055408Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.2056485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:19.2057479Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:19.2058176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2059062Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2059962Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:19.2060905Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:19.2061851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:19.2062693Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:19.2063523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:19.2064203Z     fn()
2025-05-07T20:33:19.2064883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:19.2065660Z     self.fn.run(
2025-05-07T20:33:19.2066283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2067044Z     kernel = self.compile(
2025-05-07T20:33:19.2067862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2068708Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2069210Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2069511Z 
2025-05-07T20:33:19.2069771Z self = <triton.compiler.compiler.ASTSource object at 0x7f39067be270>
2025-05-07T20:33:19.2071174Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2072998Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38ffeae700>}
2025-05-07T20:33:19.2074808Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2076160Z context = <triton._C.libtriton.ir.context object at 0x7f3906e88cb0>
2025-05-07T20:33:19.2076537Z 
2025-05-07T20:33:19.2076752Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2077431Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2078036Z                            module_map=module_map)
2025-05-07T20:33:19.2078489Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2078945Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:19.2079293Z E       ^
2025-05-07T20:33:19.2079942Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2080537Z 
2025-05-07T20:33:19.2081085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2081768Z 
2025-05-07T20:33:19.2081900Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2082434Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2082941Z     T=2048,
2025-05-07T20:33:19.2083181Z     D=5120,
2025-05-07T20:33:19.2083431Z     scale_ub=1200.0,
2025-05-07T20:33:19.2083735Z     contiguous=True,
2025-05-07T20:33:19.2084048Z     compiled=False,
2025-05-07T20:33:19.2084317Z )
2025-05-07T20:33:19.2084739Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2085391Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:19.2085762Z 
2025-05-07T20:33:19.2085865Z     @given(
2025-05-07T20:33:19.2086152Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2086543Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2086951Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2087411Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2087830Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2088202Z     )
2025-05-07T20:33:19.2088657Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2089227Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2089543Z         self,
2025-05-07T20:33:19.2089813Z         T: int,
2025-05-07T20:33:19.2090082Z         D: int,
2025-05-07T20:33:19.2090378Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2090726Z         contiguous: bool,
2025-05-07T20:33:19.2091045Z         compiled: bool,
2025-05-07T20:33:19.2091331Z     ) -> None:
2025-05-07T20:33:19.2091615Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2091946Z     
2025-05-07T20:33:19.2092287Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2092749Z     
2025-05-07T20:33:19.2093073Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2093461Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2093886Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2094218Z         x0 = x[:, :D]
2025-05-07T20:33:19.2094509Z         x1 = x[:, D:]
2025-05-07T20:33:19.2094799Z     
2025-05-07T20:33:19.2095057Z         if contiguous:
2025-05-07T20:33:19.2095367Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2095725Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2096072Z     
2025-05-07T20:33:19.2096343Z         if scale_ub is not None:
2025-05-07T20:33:19.2096698Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2097143Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2097554Z             )
2025-05-07T20:33:19.2097818Z         else:
2025-05-07T20:33:19.2098088Z             scale_ub_tensor = None
2025-05-07T20:33:19.2098409Z     
2025-05-07T20:33:19.2098715Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2099145Z             op = silu_mul_quant
2025-05-07T20:33:19.2099591Z             if compiled:
2025-05-07T20:33:19.2099913Z                 op = torch.compile(op)
2025-05-07T20:33:19.2100318Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2100676Z     
2025-05-07T20:33:19.2100924Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2101154Z 
2025-05-07T20:33:19.2101288Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2101697Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2102130Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2102503Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2103444Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2104489Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2105242Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2106149Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2107009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2107797Z     kernel = self.compile(
2025-05-07T20:33:19.2108534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2109442Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2109961Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2110259Z 
2025-05-07T20:33:19.2110526Z self = <triton.compiler.compiler.ASTSource object at 0x7f38ffe3d090>
2025-05-07T20:33:19.2111997Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2113865Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38ffd5e020>}
2025-05-07T20:33:19.2115663Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2117044Z context = <triton._C.libtriton.ir.context object at 0x7f39043badf0>
2025-05-07T20:33:19.2117421Z 
2025-05-07T20:33:19.2117650Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2118378Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2119033Z                            module_map=module_map)
2025-05-07T20:33:19.2119597Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2120086Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2120433Z E       ^
2025-05-07T20:33:19.2121044Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2121689Z 
2025-05-07T20:33:19.2122273Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2122997Z 
2025-05-07T20:33:19.2123134Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2123677Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2124204Z     T=2048,
2025-05-07T20:33:19.2124440Z     D=5120,
2025-05-07T20:33:19.2124687Z     scale_ub=1200.0,
2025-05-07T20:33:19.2124975Z     contiguous=True,
2025-05-07T20:33:19.2125249Z     compiled=True,
2025-05-07T20:33:19.2125511Z )
2025-05-07T20:33:19.2125948Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2126660Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:19.2127062Z 
2025-05-07T20:33:19.2127163Z     @given(
2025-05-07T20:33:19.2127455Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2127856Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2128272Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2128715Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2129166Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2129538Z     )
2025-05-07T20:33:19.2129997Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2130597Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2130923Z         self,
2025-05-07T20:33:19.2131235Z         T: int,
2025-05-07T20:33:19.2131500Z         D: int,
2025-05-07T20:33:19.2131776Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2132125Z         contiguous: bool,
2025-05-07T20:33:19.2132429Z         compiled: bool,
2025-05-07T20:33:19.2132712Z     ) -> None:
2025-05-07T20:33:19.2132988Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2133299Z     
2025-05-07T20:33:19.2133636Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2134070Z     
2025-05-07T20:33:19.2134314Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2134674Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2135069Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2135378Z         x0 = x[:, :D]
2025-05-07T20:33:19.2135646Z         x1 = x[:, D:]
2025-05-07T20:33:19.2135912Z     
2025-05-07T20:33:19.2136150Z         if contiguous:
2025-05-07T20:33:19.2136446Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2136770Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2137082Z     
2025-05-07T20:33:19.2137332Z         if scale_ub is not None:
2025-05-07T20:33:19.2137683Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2138120Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2138525Z             )
2025-05-07T20:33:19.2138775Z         else:
2025-05-07T20:33:19.2139048Z             scale_ub_tensor = None
2025-05-07T20:33:19.2139377Z     
2025-05-07T20:33:19.2139666Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2140349Z             op = silu_mul_quant
2025-05-07T20:33:19.2140686Z             if compiled:
2025-05-07T20:33:19.2140998Z                 op = torch.compile(op)
2025-05-07T20:33:19.2141392Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2141759Z     
2025-05-07T20:33:19.2142005Z         y_fp8, y_scale = fn()
2025-05-07T20:33:19.2142380Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:19.2142763Z     
2025-05-07T20:33:19.2143073Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2143503Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:19.2144038Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:19.2144463Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:19.2144927Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.2145337Z     
2025-05-07T20:33:19.2145603Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:19.2145867Z 
2025-05-07T20:33:19.2145995Z moe/activation_test.py:126: 
2025-05-07T20:33:19.2146386Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2146830Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:19.2147253Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.2148381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:19.2149395Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:19.2150112Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2151177Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2152109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:19.2153079Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:19.2154078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:19.2154928Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:19.2155740Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:19.2156521Z     fn()
2025-05-07T20:33:19.2157205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:19.2157975Z     self.fn.run(
2025-05-07T20:33:19.2158605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2159324Z     kernel = self.compile(
2025-05-07T20:33:19.2160042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2160932Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2161488Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2161796Z 
2025-05-07T20:33:19.2162082Z self = <triton.compiler.compiler.ASTSource object at 0x7f38ffe3e0d0>
2025-05-07T20:33:19.2163517Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2165383Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38fec3e200>}
2025-05-07T20:33:19.2167264Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2168646Z context = <triton._C.libtriton.ir.context object at 0x7f38fea262b0>
2025-05-07T20:33:19.2169042Z 
2025-05-07T20:33:19.2169273Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2169973Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2170620Z                            module_map=module_map)
2025-05-07T20:33:19.2171120Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2171593Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:19.2171949Z E       ^
2025-05-07T20:33:19.2172630Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2173272Z 
2025-05-07T20:33:19.2173849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2174540Z 
2025-05-07T20:33:19.2174680Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2175229Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2175772Z     T=16384,
2025-05-07T20:33:19.2176024Z     D=7168,
2025-05-07T20:33:19.2176286Z     scale_ub=1200.0,
2025-05-07T20:33:19.2176594Z     contiguous=False,
2025-05-07T20:33:19.2176893Z     compiled=False,
2025-05-07T20:33:19.2177185Z )
2025-05-07T20:33:19.2177614Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2178297Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:19.2178674Z 
2025-05-07T20:33:19.2178780Z     @given(
2025-05-07T20:33:19.2179195Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2179621Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2180017Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2180480Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2180926Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2181296Z     )
2025-05-07T20:33:19.2181756Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2182332Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2182639Z         self,
2025-05-07T20:33:19.2182902Z         T: int,
2025-05-07T20:33:19.2183172Z         D: int,
2025-05-07T20:33:19.2183469Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2183891Z         contiguous: bool,
2025-05-07T20:33:19.2184211Z         compiled: bool,
2025-05-07T20:33:19.2184499Z     ) -> None:
2025-05-07T20:33:19.2184778Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2185103Z     
2025-05-07T20:33:19.2185461Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2185916Z     
2025-05-07T20:33:19.2186206Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2186586Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2186983Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2187293Z         x0 = x[:, :D]
2025-05-07T20:33:19.2187641Z         x1 = x[:, D:]
2025-05-07T20:33:19.2187907Z     
2025-05-07T20:33:19.2188154Z         if contiguous:
2025-05-07T20:33:19.2188459Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2188794Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2189117Z     
2025-05-07T20:33:19.2189382Z         if scale_ub is not None:
2025-05-07T20:33:19.2189759Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2190212Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2190635Z             )
2025-05-07T20:33:19.2190901Z         else:
2025-05-07T20:33:19.2191184Z             scale_ub_tensor = None
2025-05-07T20:33:19.2191524Z     
2025-05-07T20:33:19.2191830Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2192255Z             op = silu_mul_quant
2025-05-07T20:33:19.2192594Z             if compiled:
2025-05-07T20:33:19.2192923Z                 op = torch.compile(op)
2025-05-07T20:33:19.2193277Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2193544Z     
2025-05-07T20:33:19.2193730Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2193892Z 
2025-05-07T20:33:19.2193990Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2194281Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2194603Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2194879Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2195624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2196357Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2196910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2197576Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2198227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2198750Z     kernel = self.compile(
2025-05-07T20:33:19.2199293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2199928Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2200062Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2200070Z 
2025-05-07T20:33:19.2200275Z self = <triton.compiler.compiler.ASTSource object at 0x7f38febe1220>
2025-05-07T20:33:19.2201110Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2201669Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38fee484a0>}
2025-05-07T20:33:19.2202398Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2202590Z context = <triton._C.libtriton.ir.context object at 0x7f38fea4dd30>
2025-05-07T20:33:19.2202634Z 
2025-05-07T20:33:19.2202796Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2203064Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2203174Z                            module_map=module_map)
2025-05-07T20:33:19.2203334Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2203434Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2203509Z E       ^
2025-05-07T20:33:19.2203860Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2203872Z 
2025-05-07T20:33:19.2204285Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2204290Z 
2025-05-07T20:33:19.2204389Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2204614Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2204694Z     T=1,
2025-05-07T20:33:19.2204769Z     D=7168,
2025-05-07T20:33:19.2204854Z     scale_ub=None,
2025-05-07T20:33:19.2204941Z     contiguous=True,
2025-05-07T20:33:19.2205023Z     compiled=True,
2025-05-07T20:33:19.2205103Z )
2025-05-07T20:33:19.2205324Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2205488Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:19.2205493Z 
2025-05-07T20:33:19.2205568Z     @given(
2025-05-07T20:33:19.2205684Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2212405Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2212550Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2212668Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2212785Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2212859Z     )
2025-05-07T20:33:19.2213108Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2213210Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2213286Z         self,
2025-05-07T20:33:19.2213436Z         T: int,
2025-05-07T20:33:19.2213514Z         D: int,
2025-05-07T20:33:19.2213611Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2213706Z         contiguous: bool,
2025-05-07T20:33:19.2213787Z         compiled: bool,
2025-05-07T20:33:19.2213863Z     ) -> None:
2025-05-07T20:33:19.2213963Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2214034Z     
2025-05-07T20:33:19.2214203Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2214281Z     
2025-05-07T20:33:19.2214371Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2214493Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2214585Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2214665Z         x0 = x[:, :D]
2025-05-07T20:33:19.2214743Z         x1 = x[:, D:]
2025-05-07T20:33:19.2214824Z     
2025-05-07T20:33:19.2214904Z         if contiguous:
2025-05-07T20:33:19.2215001Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2215090Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2215209Z     
2025-05-07T20:33:19.2215341Z         if scale_ub is not None:
2025-05-07T20:33:19.2215445Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2215577Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2215658Z             )
2025-05-07T20:33:19.2215731Z         else:
2025-05-07T20:33:19.2215824Z             scale_ub_tensor = None
2025-05-07T20:33:19.2215910Z     
2025-05-07T20:33:19.2216054Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2216161Z             op = silu_mul_quant
2025-05-07T20:33:19.2216254Z             if compiled:
2025-05-07T20:33:19.2216352Z                 op = torch.compile(op)
2025-05-07T20:33:19.2216460Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2216571Z     
2025-05-07T20:33:19.2216660Z         y_fp8, y_scale = fn()
2025-05-07T20:33:19.2216785Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:19.2216857Z     
2025-05-07T20:33:19.2216991Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2217104Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:19.2217201Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:19.2217318Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:19.2217459Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.2217530Z     
2025-05-07T20:33:19.2217633Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:19.2217638Z 
2025-05-07T20:33:19.2217739Z moe/activation_test.py:126: 
2025-05-07T20:33:19.2217864Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2217971Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:19.2218102Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.2218693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:19.2218801Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:19.2219171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2219397Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2219763Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:19.2220013Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:19.2220392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:19.2220560Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:19.2220914Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:19.2221032Z     fn()
2025-05-07T20:33:19.2221449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:19.2221536Z     self.fn.run(
2025-05-07T20:33:19.2221869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2221959Z     kernel = self.compile(
2025-05-07T20:33:19.2222343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2222512Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2222644Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2222649Z 
2025-05-07T20:33:19.2222851Z self = <triton.compiler.compiler.ASTSource object at 0x7f38febe3950>
2025-05-07T20:33:19.2223663Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2224199Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38fef04ea0>}
2025-05-07T20:33:19.2224934Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2225128Z context = <triton._C.libtriton.ir.context object at 0x7f38d9fd3e70>
2025-05-07T20:33:19.2225133Z 
2025-05-07T20:33:19.2225295Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2225558Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2225709Z                            module_map=module_map)
2025-05-07T20:33:19.2225876Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2226004Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:19.2226087Z E       ^
2025-05-07T20:33:19.2226473Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2226478Z 
2025-05-07T20:33:19.2226905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2226910Z 
2025-05-07T20:33:19.2227010Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2227234Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2227309Z     T=4096,
2025-05-07T20:33:19.2227385Z     D=5120,
2025-05-07T20:33:19.2227550Z     scale_ub=None,
2025-05-07T20:33:19.2227640Z     contiguous=False,
2025-05-07T20:33:19.2227721Z     compiled=False,
2025-05-07T20:33:19.2227799Z )
2025-05-07T20:33:19.2228018Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2228197Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:19.2228202Z 
2025-05-07T20:33:19.2228282Z     @given(
2025-05-07T20:33:19.2228398Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2228497Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2228617Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2228731Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2228851Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2228926Z     )
2025-05-07T20:33:19.2229169Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2229269Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2229345Z         self,
2025-05-07T20:33:19.2229420Z         T: int,
2025-05-07T20:33:19.2229503Z         D: int,
2025-05-07T20:33:19.2229599Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2229736Z         contiguous: bool,
2025-05-07T20:33:19.2229835Z         compiled: bool,
2025-05-07T20:33:19.2229914Z     ) -> None:
2025-05-07T20:33:19.2230007Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2230084Z     
2025-05-07T20:33:19.2230249Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2230328Z     
2025-05-07T20:33:19.2230418Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2230539Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2230633Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2230714Z         x0 = x[:, :D]
2025-05-07T20:33:19.2230793Z         x1 = x[:, D:]
2025-05-07T20:33:19.2230869Z     
2025-05-07T20:33:19.2230949Z         if contiguous:
2025-05-07T20:33:19.2231036Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2231129Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2231200Z     
2025-05-07T20:33:19.2231285Z         if scale_ub is not None:
2025-05-07T20:33:19.2231397Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2231610Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2231693Z             )
2025-05-07T20:33:19.2231772Z         else:
2025-05-07T20:33:19.2231865Z             scale_ub_tensor = None
2025-05-07T20:33:19.2231941Z     
2025-05-07T20:33:19.2232068Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2232155Z             op = silu_mul_quant
2025-05-07T20:33:19.2232243Z             if compiled:
2025-05-07T20:33:19.2232340Z                 op = torch.compile(op)
2025-05-07T20:33:19.2232445Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2232523Z     
2025-05-07T20:33:19.2232611Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2232615Z 
2025-05-07T20:33:19.2232710Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2232881Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2232979Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2233087Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2233589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2233684Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2234058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2234282Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2234629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2234724Z     kernel = self.compile(
2025-05-07T20:33:19.2235120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2235302Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2235429Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2235438Z 
2025-05-07T20:33:19.2235638Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d9f10cb0>
2025-05-07T20:33:19.2236408Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2236917Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38fe31c2c0>}
2025-05-07T20:33:19.2237658Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2237848Z context = <triton._C.libtriton.ir.context object at 0x7f38fe62f530>
2025-05-07T20:33:19.2237895Z 
2025-05-07T20:33:19.2238066Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2238322Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2238427Z                            module_map=module_map)
2025-05-07T20:33:19.2238593Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2238690Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2238770Z E       ^
2025-05-07T20:33:19.2239128Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2239133Z 
2025-05-07T20:33:19.2239559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2239567Z 
2025-05-07T20:33:19.2239679Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2239902Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2240046Z     T=4096,
2025-05-07T20:33:19.2240477Z     D=7168,
2025-05-07T20:33:19.2240566Z     scale_ub=None,
2025-05-07T20:33:19.2240651Z     contiguous=False,
2025-05-07T20:33:19.2240742Z     compiled=False,
2025-05-07T20:33:19.2240814Z )
2025-05-07T20:33:19.2241041Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2241220Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:19.2241225Z 
2025-05-07T20:33:19.2241304Z     @given(
2025-05-07T20:33:19.2241427Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2241532Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2241643Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2241838Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2241949Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2242023Z     )
2025-05-07T20:33:19.2242281Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2242376Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2242459Z         self,
2025-05-07T20:33:19.2242534Z         T: int,
2025-05-07T20:33:19.2242608Z         D: int,
2025-05-07T20:33:19.2242709Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2242795Z         contiguous: bool,
2025-05-07T20:33:19.2242877Z         compiled: bool,
2025-05-07T20:33:19.2242960Z     ) -> None:
2025-05-07T20:33:19.2243053Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2243125Z     
2025-05-07T20:33:19.2243297Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2243370Z     
2025-05-07T20:33:19.2243459Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2243586Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2243676Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2243757Z         x0 = x[:, :D]
2025-05-07T20:33:19.2243836Z         x1 = x[:, D:]
2025-05-07T20:33:19.2243905Z     
2025-05-07T20:33:19.2243993Z         if contiguous:
2025-05-07T20:33:19.2244087Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2244172Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2244251Z     
2025-05-07T20:33:19.2244338Z         if scale_ub is not None:
2025-05-07T20:33:19.2244440Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2244577Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2244657Z             )
2025-05-07T20:33:19.2244734Z         else:
2025-05-07T20:33:19.2244828Z             scale_ub_tensor = None
2025-05-07T20:33:19.2244902Z     
2025-05-07T20:33:19.2245029Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2245121Z             op = silu_mul_quant
2025-05-07T20:33:19.2245206Z             if compiled:
2025-05-07T20:33:19.2245309Z                 op = torch.compile(op)
2025-05-07T20:33:19.2245413Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2245560Z     
2025-05-07T20:33:19.2245655Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2245664Z 
2025-05-07T20:33:19.2245759Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2245883Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2245987Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2246083Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2246586Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2246680Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2247037Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2247268Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2247611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2247701Z     kernel = self.compile(
2025-05-07T20:33:19.2248187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2248366Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2248494Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2248499Z 
2025-05-07T20:33:19.2248699Z self = <triton.compiler.compiler.ASTSource object at 0x7f38fec67020>
2025-05-07T20:33:19.2249461Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2249998Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38fe31d300>}
2025-05-07T20:33:19.2250735Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2250932Z context = <triton._C.libtriton.ir.context object at 0x7f38d9ce8bf0>
2025-05-07T20:33:19.2250937Z 
2025-05-07T20:33:19.2251099Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2251365Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2251472Z                            module_map=module_map)
2025-05-07T20:33:19.2251631Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2251735Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2251814Z E       ^
2025-05-07T20:33:19.2252172Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2252180Z 
2025-05-07T20:33:19.2252613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2252620Z 
2025-05-07T20:33:19.2252718Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2252945Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2253020Z     T=128,
2025-05-07T20:33:19.2253095Z     D=7168,
2025-05-07T20:33:19.2253179Z     scale_ub=None,
2025-05-07T20:33:19.2253262Z     contiguous=False,
2025-05-07T20:33:19.2253341Z     compiled=True,
2025-05-07T20:33:19.2253420Z )
2025-05-07T20:33:19.2253638Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2253805Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:19.2253813Z 
2025-05-07T20:33:19.2253897Z     @given(
2025-05-07T20:33:19.2254014Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2254163Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2254281Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2254396Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2254513Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2254586Z     )
2025-05-07T20:33:19.2254834Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2254932Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2255007Z         self,
2025-05-07T20:33:19.2255080Z         T: int,
2025-05-07T20:33:19.2255164Z         D: int,
2025-05-07T20:33:19.2255261Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2255348Z         contiguous: bool,
2025-05-07T20:33:19.2255438Z         compiled: bool,
2025-05-07T20:33:19.2255513Z     ) -> None:
2025-05-07T20:33:19.2255612Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2255686Z     
2025-05-07T20:33:19.2255866Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2255962Z     
2025-05-07T20:33:19.2256156Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2256281Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2256376Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2256459Z         x0 = x[:, :D]
2025-05-07T20:33:19.2256536Z         x1 = x[:, D:]
2025-05-07T20:33:19.2256623Z     
2025-05-07T20:33:19.2256706Z         if contiguous:
2025-05-07T20:33:19.2256797Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2256891Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2256960Z     
2025-05-07T20:33:19.2257059Z         if scale_ub is not None:
2025-05-07T20:33:19.2257163Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2257298Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2257420Z             )
2025-05-07T20:33:19.2257496Z         else:
2025-05-07T20:33:19.2257590Z             scale_ub_tensor = None
2025-05-07T20:33:19.2257669Z     
2025-05-07T20:33:19.2257801Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2257894Z             op = silu_mul_quant
2025-05-07T20:33:19.2257985Z             if compiled:
2025-05-07T20:33:19.2258084Z                 op = torch.compile(op)
2025-05-07T20:33:19.2258189Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2258269Z     
2025-05-07T20:33:19.2258358Z         y_fp8, y_scale = fn()
2025-05-07T20:33:19.2258483Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:19.2258557Z     
2025-05-07T20:33:19.2258690Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2258798Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:19.2258894Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:19.2259015Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:19.2259165Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.2259237Z     
2025-05-07T20:33:19.2259338Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:19.2259345Z 
2025-05-07T20:33:19.2259448Z moe/activation_test.py:126: 
2025-05-07T20:33:19.2259575Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2259685Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:19.2259816Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.2260403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:19.2260508Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:19.2260864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2261086Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2261461Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:19.2261844Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:19.2262225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:19.2262394Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:19.2262746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:19.2262828Z     fn()
2025-05-07T20:33:19.2263226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:19.2263311Z     self.fn.run(
2025-05-07T20:33:19.2263646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2263740Z     kernel = self.compile(
2025-05-07T20:33:19.2264133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2264384Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2264510Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2264514Z 
2025-05-07T20:33:19.2264724Z self = <triton.compiler.compiler.ASTSource object at 0x7f38fe3c99d0>
2025-05-07T20:33:19.2265487Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2266018Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38fe31fe20>}
2025-05-07T20:33:19.2266793Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2266993Z context = <triton._C.libtriton.ir.context object at 0x7f38fe4232f0>
2025-05-07T20:33:19.2266998Z 
2025-05-07T20:33:19.2267159Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2267486Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2267599Z                            module_map=module_map)
2025-05-07T20:33:19.2267758Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2267857Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:19.2267940Z E       ^
2025-05-07T20:33:19.2268303Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2268311Z 
2025-05-07T20:33:19.2268742Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2268746Z 
2025-05-07T20:33:19.2268852Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2269068Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2269150Z     T=128,
2025-05-07T20:33:19.2269224Z     D=7168,
2025-05-07T20:33:19.2269305Z     scale_ub=None,
2025-05-07T20:33:19.2269396Z     contiguous=False,
2025-05-07T20:33:19.2269475Z     compiled=False,
2025-05-07T20:33:19.2269552Z )
2025-05-07T20:33:19.2269766Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2269933Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:19.2269937Z 
2025-05-07T20:33:19.2270018Z     @given(
2025-05-07T20:33:19.2270133Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2270230Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2270347Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2270536Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2270656Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2270730Z     )
2025-05-07T20:33:19.2270977Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2271072Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2271146Z         self,
2025-05-07T20:33:19.2271219Z         T: int,
2025-05-07T20:33:19.2271303Z         D: int,
2025-05-07T20:33:19.2271401Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2271490Z         contiguous: bool,
2025-05-07T20:33:19.2271583Z         compiled: bool,
2025-05-07T20:33:19.2271660Z     ) -> None:
2025-05-07T20:33:19.2271754Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2271830Z     
2025-05-07T20:33:19.2271997Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2272081Z     
2025-05-07T20:33:19.2272168Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2272298Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2272427Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2272541Z         x0 = x[:, :D]
2025-05-07T20:33:19.2272627Z         x1 = x[:, D:]
2025-05-07T20:33:19.2272699Z     
2025-05-07T20:33:19.2272788Z         if contiguous:
2025-05-07T20:33:19.2272878Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2272963Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2273039Z     
2025-05-07T20:33:19.2273128Z         if scale_ub is not None:
2025-05-07T20:33:19.2273230Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2273371Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2273447Z             )
2025-05-07T20:33:19.2273524Z         else:
2025-05-07T20:33:19.2273621Z             scale_ub_tensor = None
2025-05-07T20:33:19.2273737Z     
2025-05-07T20:33:19.2273862Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2273958Z             op = silu_mul_quant
2025-05-07T20:33:19.2274043Z             if compiled:
2025-05-07T20:33:19.2274141Z                 op = torch.compile(op)
2025-05-07T20:33:19.2274254Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2274327Z     
2025-05-07T20:33:19.2274423Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2274427Z 
2025-05-07T20:33:19.2274523Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2274646Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2274747Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2274843Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2275347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2275446Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2275822Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2276054Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2276395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2276486Z     kernel = self.compile(
2025-05-07T20:33:19.2276890Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2277062Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2277193Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2277197Z 
2025-05-07T20:33:19.2277397Z self = <triton.compiler.compiler.ASTSource object at 0x7f38feaa0f50>
2025-05-07T20:33:19.2278158Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2278739Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d9e787c0>}
2025-05-07T20:33:19.2279474Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2279665Z context = <triton._C.libtriton.ir.context object at 0x7f38d96c6f70>
2025-05-07T20:33:19.2279670Z 
2025-05-07T20:33:19.2279830Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2280090Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2280202Z                            module_map=module_map)
2025-05-07T20:33:19.2280363Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2280462Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2280540Z E       ^
2025-05-07T20:33:19.2280973Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2280978Z 
2025-05-07T20:33:19.2281415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2281419Z 
2025-05-07T20:33:19.2281518Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2281744Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2281819Z     T=4096,
2025-05-07T20:33:19.2281895Z     D=5120,
2025-05-07T20:33:19.2281981Z     scale_ub=1200.0,
2025-05-07T20:33:19.2282063Z     contiguous=True,
2025-05-07T20:33:19.2282141Z     compiled=False,
2025-05-07T20:33:19.2282218Z )
2025-05-07T20:33:19.2282475Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2282653Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:19.2282657Z 
2025-05-07T20:33:19.2282741Z     @given(
2025-05-07T20:33:19.2282858Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2282955Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2283071Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2283183Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2283297Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2283370Z     )
2025-05-07T20:33:19.2283610Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2283708Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2283784Z         self,
2025-05-07T20:33:19.2283857Z         T: int,
2025-05-07T20:33:19.2283939Z         D: int,
2025-05-07T20:33:19.2284040Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2284124Z         contiguous: bool,
2025-05-07T20:33:19.2284214Z         compiled: bool,
2025-05-07T20:33:19.2284290Z     ) -> None:
2025-05-07T20:33:19.2284382Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2284460Z     
2025-05-07T20:33:19.2284622Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2284697Z     
2025-05-07T20:33:19.2284786Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2284906Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2284996Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2285071Z         x0 = x[:, :D]
2025-05-07T20:33:19.2285144Z         x1 = x[:, D:]
2025-05-07T20:33:19.2285214Z     
2025-05-07T20:33:19.2285294Z         if contiguous:
2025-05-07T20:33:19.2285383Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2285474Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2285544Z     
2025-05-07T20:33:19.2285630Z         if scale_ub is not None:
2025-05-07T20:33:19.2285742Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2285873Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2285997Z             )
2025-05-07T20:33:19.2286074Z         else:
2025-05-07T20:33:19.2286166Z             scale_ub_tensor = None
2025-05-07T20:33:19.2286243Z     
2025-05-07T20:33:19.2286368Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2286458Z             op = silu_mul_quant
2025-05-07T20:33:19.2286544Z             if compiled:
2025-05-07T20:33:19.2286640Z                 op = torch.compile(op)
2025-05-07T20:33:19.2286743Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2286817Z     
2025-05-07T20:33:19.2286903Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2286909Z 
2025-05-07T20:33:19.2287002Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2287133Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2287234Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2287338Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2287845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2288019Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2288378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2288600Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2288942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2289031Z     kernel = self.compile(
2025-05-07T20:33:19.2289432Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2289606Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2289769Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2289774Z 
2025-05-07T20:33:19.2289975Z self = <triton.compiler.compiler.ASTSource object at 0x7f38feaa3850>
2025-05-07T20:33:19.2290745Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2291269Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d9e78c20>}
2025-05-07T20:33:19.2292007Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2292192Z context = <triton._C.libtriton.ir.context object at 0x7f38fe4b7230>
2025-05-07T20:33:19.2292201Z 
2025-05-07T20:33:19.2292366Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2292632Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2292741Z                            module_map=module_map)
2025-05-07T20:33:19.2292907Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2293003Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2293080Z E       ^
2025-05-07T20:33:19.2293432Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2293437Z 
2025-05-07T20:33:19.2293862Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2293866Z 
2025-05-07T20:33:19.2293971Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2294189Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2294264Z     T=1,
2025-05-07T20:33:19.2294342Z     D=5120,
2025-05-07T20:33:19.2294465Z     scale_ub=None,
2025-05-07T20:33:19.2294547Z     contiguous=True,
2025-05-07T20:33:19.2294635Z     compiled=True,
2025-05-07T20:33:19.2294708Z )
2025-05-07T20:33:19.2294928Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2295085Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:19.2295089Z 
2025-05-07T20:33:19.2295166Z     @given(
2025-05-07T20:33:19.2295285Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2295382Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2295493Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2295616Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2295725Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2295801Z     )
2025-05-07T20:33:19.2296046Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2296137Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2296221Z         self,
2025-05-07T20:33:19.2296345Z         T: int,
2025-05-07T20:33:19.2296456Z         D: int,
2025-05-07T20:33:19.2296556Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2296641Z         contiguous: bool,
2025-05-07T20:33:19.2296723Z         compiled: bool,
2025-05-07T20:33:19.2296803Z     ) -> None:
2025-05-07T20:33:19.2296894Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2296964Z     
2025-05-07T20:33:19.2297135Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2297211Z     
2025-05-07T20:33:19.2297299Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2297423Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2297508Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2297589Z         x0 = x[:, :D]
2025-05-07T20:33:19.2297729Z         x1 = x[:, D:]
2025-05-07T20:33:19.2297801Z     
2025-05-07T20:33:19.2297887Z         if contiguous:
2025-05-07T20:33:19.2297973Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2298060Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2298139Z     
2025-05-07T20:33:19.2298229Z         if scale_ub is not None:
2025-05-07T20:33:19.2298330Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2298467Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2298542Z             )
2025-05-07T20:33:19.2298616Z         else:
2025-05-07T20:33:19.2298711Z             scale_ub_tensor = None
2025-05-07T20:33:19.2298779Z     
2025-05-07T20:33:19.2298903Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2298995Z             op = silu_mul_quant
2025-05-07T20:33:19.2299076Z             if compiled:
2025-05-07T20:33:19.2299178Z                 op = torch.compile(op)
2025-05-07T20:33:19.2299280Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2299353Z     
2025-05-07T20:33:19.2299444Z         y_fp8, y_scale = fn()
2025-05-07T20:33:19.2299565Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:19.2299639Z     
2025-05-07T20:33:19.2299779Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2299878Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:19.2299976Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:19.2300096Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:19.2300231Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.2300308Z     
2025-05-07T20:33:19.2300407Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:19.2300411Z 
2025-05-07T20:33:19.2300507Z moe/activation_test.py:126: 
2025-05-07T20:33:19.2300636Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2300739Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:19.2300871Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.2301499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:19.2301600Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:19.2301961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2302183Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2302542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:19.2302799Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:19.2303189Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:19.2303353Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:19.2303714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:19.2303790Z     fn()
2025-05-07T20:33:19.2304289Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:19.2304372Z     self.fn.run(
2025-05-07T20:33:19.2304707Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2304803Z     kernel = self.compile(
2025-05-07T20:33:19.2305181Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2305352Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2305480Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2305485Z 
2025-05-07T20:33:19.2305731Z self = <triton.compiler.compiler.ASTSource object at 0x7f38fe666a80>
2025-05-07T20:33:19.2306503Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2307031Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d9e7a980>}
2025-05-07T20:33:19.2307864Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2308054Z context = <triton._C.libtriton.ir.context object at 0x7f38fe49a930>
2025-05-07T20:33:19.2308059Z 
2025-05-07T20:33:19.2308219Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2308487Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2308594Z                            module_map=module_map)
2025-05-07T20:33:19.2308760Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2308862Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:19.2308936Z E       ^
2025-05-07T20:33:19.2309302Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2309306Z 
2025-05-07T20:33:19.2309730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2309735Z 
2025-05-07T20:33:19.2309837Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2310054Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2310130Z     T=2048,
2025-05-07T20:33:19.2310209Z     D=5120,
2025-05-07T20:33:19.2310290Z     scale_ub=None,
2025-05-07T20:33:19.2310372Z     contiguous=True,
2025-05-07T20:33:19.2310454Z     compiled=True,
2025-05-07T20:33:19.2310526Z )
2025-05-07T20:33:19.2310793Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2310972Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:19.2310977Z 
2025-05-07T20:33:19.2311052Z     @given(
2025-05-07T20:33:19.2311165Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2311266Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2311380Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2311497Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2311606Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2311679Z     )
2025-05-07T20:33:19.2311925Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2312018Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2312093Z         self,
2025-05-07T20:33:19.2312173Z         T: int,
2025-05-07T20:33:19.2312246Z         D: int,
2025-05-07T20:33:19.2312342Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2312478Z         contiguous: bool,
2025-05-07T20:33:19.2312597Z         compiled: bool,
2025-05-07T20:33:19.2312677Z     ) -> None:
2025-05-07T20:33:19.2312771Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2312840Z     
2025-05-07T20:33:19.2313011Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2313081Z     
2025-05-07T20:33:19.2313167Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2313295Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2313378Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2313452Z         x0 = x[:, :D]
2025-05-07T20:33:19.2313531Z         x1 = x[:, D:]
2025-05-07T20:33:19.2313600Z     
2025-05-07T20:33:19.2313678Z         if contiguous:
2025-05-07T20:33:19.2313815Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2313898Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2313967Z     
2025-05-07T20:33:19.2314057Z         if scale_ub is not None:
2025-05-07T20:33:19.2314162Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2314302Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2314376Z             )
2025-05-07T20:33:19.2314453Z         else:
2025-05-07T20:33:19.2314547Z             scale_ub_tensor = None
2025-05-07T20:33:19.2314619Z     
2025-05-07T20:33:19.2314745Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2314840Z             op = silu_mul_quant
2025-05-07T20:33:19.2314923Z             if compiled:
2025-05-07T20:33:19.2315017Z                 op = torch.compile(op)
2025-05-07T20:33:19.2315124Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2315193Z     
2025-05-07T20:33:19.2315282Z         y_fp8, y_scale = fn()
2025-05-07T20:33:19.2315406Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:19.2315479Z     
2025-05-07T20:33:19.2315612Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2315712Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:19.2315810Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:19.2315934Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:19.2316068Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.2316136Z     
2025-05-07T20:33:19.2316236Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:19.2316241Z 
2025-05-07T20:33:19.2316333Z moe/activation_test.py:126: 
2025-05-07T20:33:19.2316457Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2316562Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:19.2316691Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.2317278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:19.2317379Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:19.2317787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2318019Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2318392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:19.2318649Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:19.2319020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:19.2319180Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:19.2319524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:19.2319600Z     fn()
2025-05-07T20:33:19.2320013Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:19.2320143Z     self.fn.run(
2025-05-07T20:33:19.2320513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2320607Z     kernel = self.compile(
2025-05-07T20:33:19.2320983Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2321150Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2321279Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2321284Z 
2025-05-07T20:33:19.2321490Z self = <triton.compiler.compiler.ASTSource object at 0x7f38fe666b70>
2025-05-07T20:33:19.2322261Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2322830Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d9ec8860>}
2025-05-07T20:33:19.2323561Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2323753Z context = <triton._C.libtriton.ir.context object at 0x7f38fe2228b0>
2025-05-07T20:33:19.2323758Z 
2025-05-07T20:33:19.2323916Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2324182Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2324286Z                            module_map=module_map)
2025-05-07T20:33:19.2324445Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2324547Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:19.2324624Z E       ^
2025-05-07T20:33:19.2324974Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2324983Z 
2025-05-07T20:33:19.2325402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2325407Z 
2025-05-07T20:33:19.2325506Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2325727Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2325806Z     T=128,
2025-05-07T20:33:19.2325883Z     D=5120,
2025-05-07T20:33:19.2325965Z     scale_ub=None,
2025-05-07T20:33:19.2326065Z     contiguous=True,
2025-05-07T20:33:19.2326151Z     compiled=True,
2025-05-07T20:33:19.2326244Z )
2025-05-07T20:33:19.2326467Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2326678Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:19.2326686Z 
2025-05-07T20:33:19.2326764Z     @given(
2025-05-07T20:33:19.2326879Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2326979Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2327090Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2327201Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2327315Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2327387Z     )
2025-05-07T20:33:19.2327632Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2327724Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2327798Z         self,
2025-05-07T20:33:19.2327876Z         T: int,
2025-05-07T20:33:19.2327951Z         D: int,
2025-05-07T20:33:19.2328047Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2328141Z         contiguous: bool,
2025-05-07T20:33:19.2328220Z         compiled: bool,
2025-05-07T20:33:19.2328300Z     ) -> None:
2025-05-07T20:33:19.2328443Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2328571Z     
2025-05-07T20:33:19.2328735Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2328812Z     
2025-05-07T20:33:19.2328898Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2329019Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2329107Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2329183Z         x0 = x[:, :D]
2025-05-07T20:33:19.2329267Z         x1 = x[:, D:]
2025-05-07T20:33:19.2329339Z     
2025-05-07T20:33:19.2329419Z         if contiguous:
2025-05-07T20:33:19.2329510Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2329599Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2329669Z     
2025-05-07T20:33:19.2329805Z         if scale_ub is not None:
2025-05-07T20:33:19.2329906Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2330036Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2330115Z             )
2025-05-07T20:33:19.2330188Z         else:
2025-05-07T20:33:19.2330284Z             scale_ub_tensor = None
2025-05-07T20:33:19.2330360Z     
2025-05-07T20:33:19.2330484Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2330582Z             op = silu_mul_quant
2025-05-07T20:33:19.2330663Z             if compiled:
2025-05-07T20:33:19.2341645Z                 op = torch.compile(op)
2025-05-07T20:33:19.2341777Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2341849Z     
2025-05-07T20:33:19.2341946Z         y_fp8, y_scale = fn()
2025-05-07T20:33:19.2342071Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:19.2342141Z     
2025-05-07T20:33:19.2342281Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2342391Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:19.2342491Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:19.2342612Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:19.2342755Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.2342830Z     
2025-05-07T20:33:19.2342928Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:19.2342934Z 
2025-05-07T20:33:19.2343032Z moe/activation_test.py:126: 
2025-05-07T20:33:19.2343165Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2343270Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:19.2343406Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.2343961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:19.2344061Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:19.2344427Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2344764Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2345153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:19.2345406Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:19.2345781Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:19.2345947Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:19.2346286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:19.2346363Z     fn()
2025-05-07T20:33:19.2346761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:19.2346845Z     self.fn.run(
2025-05-07T20:33:19.2347185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2347398Z     kernel = self.compile(
2025-05-07T20:33:19.2347843Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2348017Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2348142Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2348147Z 
2025-05-07T20:33:19.2348349Z self = <triton.compiler.compiler.ASTSource object at 0x7f38fe259fd0>
2025-05-07T20:33:19.2349121Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2349688Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d9836ca0>}
2025-05-07T20:33:19.2350434Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2350621Z context = <triton._C.libtriton.ir.context object at 0x7f38d9d6eaf0>
2025-05-07T20:33:19.2350626Z 
2025-05-07T20:33:19.2350794Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2351050Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2351154Z                            module_map=module_map)
2025-05-07T20:33:19.2351317Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2351419Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:19.2351496Z E       ^
2025-05-07T20:33:19.2351853Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2351860Z 
2025-05-07T20:33:19.2352274Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2352279Z 
2025-05-07T20:33:19.2352381Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2352598Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2352674Z     T=4096,
2025-05-07T20:33:19.2352751Z     D=5120,
2025-05-07T20:33:19.2352830Z     scale_ub=None,
2025-05-07T20:33:19.2352915Z     contiguous=True,
2025-05-07T20:33:19.2352996Z     compiled=True,
2025-05-07T20:33:19.2353067Z )
2025-05-07T20:33:19.2353282Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2353453Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:19.2353460Z 
2025-05-07T20:33:19.2353536Z     @given(
2025-05-07T20:33:19.2353702Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2353805Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2353917Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2354037Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2354146Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2354220Z     )
2025-05-07T20:33:19.2354462Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2354551Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2354633Z         self,
2025-05-07T20:33:19.2354710Z         T: int,
2025-05-07T20:33:19.2354782Z         D: int,
2025-05-07T20:33:19.2354881Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2354968Z         contiguous: bool,
2025-05-07T20:33:19.2355049Z         compiled: bool,
2025-05-07T20:33:19.2355132Z     ) -> None:
2025-05-07T20:33:19.2355225Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2355294Z     
2025-05-07T20:33:19.2355465Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2355618Z     
2025-05-07T20:33:19.2355707Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2355832Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2355919Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2356000Z         x0 = x[:, :D]
2025-05-07T20:33:19.2356078Z         x1 = x[:, D:]
2025-05-07T20:33:19.2356146Z     
2025-05-07T20:33:19.2356230Z         if contiguous:
2025-05-07T20:33:19.2356318Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2356402Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2356472Z     
2025-05-07T20:33:19.2356555Z         if scale_ub is not None:
2025-05-07T20:33:19.2356658Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2356789Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2356905Z             )
2025-05-07T20:33:19.2356979Z         else:
2025-05-07T20:33:19.2357074Z             scale_ub_tensor = None
2025-05-07T20:33:19.2357144Z     
2025-05-07T20:33:19.2357272Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2357360Z             op = silu_mul_quant
2025-05-07T20:33:19.2357438Z             if compiled:
2025-05-07T20:33:19.2357536Z                 op = torch.compile(op)
2025-05-07T20:33:19.2357637Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2357705Z     
2025-05-07T20:33:19.2357792Z         y_fp8, y_scale = fn()
2025-05-07T20:33:19.2357905Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:19.2357971Z     
2025-05-07T20:33:19.2358101Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2358198Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:19.2358291Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:19.2358414Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:19.2358547Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.2358626Z     
2025-05-07T20:33:19.2358720Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:19.2358729Z 
2025-05-07T20:33:19.2358821Z moe/activation_test.py:126: 
2025-05-07T20:33:19.2358949Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2359048Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:19.2359176Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.2359737Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:19.2359832Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:19.2360193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2360413Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2360820Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:19.2361076Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:19.2361445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:19.2361606Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:19.2361945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:19.2362019Z     fn()
2025-05-07T20:33:19.2362416Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:19.2362493Z     self.fn.run(
2025-05-07T20:33:19.2362824Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2362919Z     kernel = self.compile(
2025-05-07T20:33:19.2363337Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2363541Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2363665Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2363670Z 
2025-05-07T20:33:19.2363866Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d9db2c10>
2025-05-07T20:33:19.2364630Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2365119Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d974a200>}
2025-05-07T20:33:19.2365899Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2366112Z context = <triton._C.libtriton.ir.context object at 0x7f38d94e45f0>
2025-05-07T20:33:19.2366118Z 
2025-05-07T20:33:19.2366291Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2366548Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2366650Z                            module_map=module_map)
2025-05-07T20:33:19.2366809Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2366904Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:19.2366975Z E       ^
2025-05-07T20:33:19.2367323Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2367331Z 
2025-05-07T20:33:19.2367762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2367771Z 
2025-05-07T20:33:19.2367872Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2368085Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2368159Z     T=16384,
2025-05-07T20:33:19.2368234Z     D=5120,
2025-05-07T20:33:19.2368311Z     scale_ub=None,
2025-05-07T20:33:19.2368389Z     contiguous=True,
2025-05-07T20:33:19.2368470Z     compiled=True,
2025-05-07T20:33:19.2368538Z )
2025-05-07T20:33:19.2368751Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2368922Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:19.2368927Z 
2025-05-07T20:33:19.2369000Z     @given(
2025-05-07T20:33:19.2369117Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2369213Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2369365Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2369485Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2369594Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2369664Z     )
2025-05-07T20:33:19.2369902Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2369994Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2370066Z         self,
2025-05-07T20:33:19.2370142Z         T: int,
2025-05-07T20:33:19.2370215Z         D: int,
2025-05-07T20:33:19.2370305Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2370391Z         contiguous: bool,
2025-05-07T20:33:19.2370470Z         compiled: bool,
2025-05-07T20:33:19.2370544Z     ) -> None:
2025-05-07T20:33:19.2370632Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2370700Z     
2025-05-07T20:33:19.2370866Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2370933Z     
2025-05-07T20:33:19.2371018Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2371142Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2371982Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2372060Z         x0 = x[:, :D]
2025-05-07T20:33:19.2372138Z         x1 = x[:, D:]
2025-05-07T20:33:19.2372207Z     
2025-05-07T20:33:19.2372286Z         if contiguous:
2025-05-07T20:33:19.2372379Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2372461Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2372529Z     
2025-05-07T20:33:19.2372619Z         if scale_ub is not None:
2025-05-07T20:33:19.2372719Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2372851Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2372923Z             )
2025-05-07T20:33:19.2372998Z         else:
2025-05-07T20:33:19.2373134Z             scale_ub_tensor = None
2025-05-07T20:33:19.2373202Z     
2025-05-07T20:33:19.2373326Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2373414Z             op = silu_mul_quant
2025-05-07T20:33:19.2373498Z             if compiled:
2025-05-07T20:33:19.2373597Z                 op = torch.compile(op)
2025-05-07T20:33:19.2373700Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2373770Z     
2025-05-07T20:33:19.2373856Z         y_fp8, y_scale = fn()
2025-05-07T20:33:19.2373974Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:19.2374043Z     
2025-05-07T20:33:19.2374178Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2374275Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:19.2374371Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:19.2374487Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:19.2374623Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.2374699Z     
2025-05-07T20:33:19.2374795Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:19.2374800Z 
2025-05-07T20:33:19.2374895Z moe/activation_test.py:126: 
2025-05-07T20:33:19.2375019Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2375127Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:19.2375255Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.2375809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:19.2375904Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:19.2376302Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2376523Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2376887Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:19.2377186Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:19.2377559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:19.2377720Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:19.2378060Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:19.2378132Z     fn()
2025-05-07T20:33:19.2378524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:19.2378601Z     self.fn.run(
2025-05-07T20:33:19.2378930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2379020Z     kernel = self.compile(
2025-05-07T20:33:19.2379397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2379568Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2379770Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2379775Z 
2025-05-07T20:33:19.2379972Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d9d95d90>
2025-05-07T20:33:19.2380734Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2381221Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d8d38680>}
2025-05-07T20:33:19.2381947Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2382176Z context = <triton._C.libtriton.ir.context object at 0x7f38d99367f0>
2025-05-07T20:33:19.2382185Z 
2025-05-07T20:33:19.2382343Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2382602Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2382705Z                            module_map=module_map)
2025-05-07T20:33:19.2382858Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2382958Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:19.2383031Z E       ^
2025-05-07T20:33:19.2383376Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2383383Z 
2025-05-07T20:33:19.2383816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2383824Z 
2025-05-07T20:33:19.2383922Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2384141Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2384217Z     T=1,
2025-05-07T20:33:19.2384291Z     D=5120,
2025-05-07T20:33:19.2384372Z     scale_ub=1200.0,
2025-05-07T20:33:19.2384452Z     contiguous=True,
2025-05-07T20:33:19.2384528Z     compiled=True,
2025-05-07T20:33:19.2384599Z )
2025-05-07T20:33:19.2384809Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2384971Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:19.2384976Z 
2025-05-07T20:33:19.2385049Z     @given(
2025-05-07T20:33:19.2385160Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2385256Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2385367Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2385480Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2385633Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2385705Z     )
2025-05-07T20:33:19.2385947Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2386036Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2386108Z         self,
2025-05-07T20:33:19.2386183Z         T: int,
2025-05-07T20:33:19.2386255Z         D: int,
2025-05-07T20:33:19.2386348Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2386433Z         contiguous: bool,
2025-05-07T20:33:19.2386512Z         compiled: bool,
2025-05-07T20:33:19.2386584Z     ) -> None:
2025-05-07T20:33:19.2386677Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2386742Z     
2025-05-07T20:33:19.2386903Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2386976Z     
2025-05-07T20:33:19.2387066Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2387183Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2387271Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2387350Z         x0 = x[:, :D]
2025-05-07T20:33:19.2387520Z         x1 = x[:, D:]
2025-05-07T20:33:19.2387631Z     
2025-05-07T20:33:19.2387712Z         if contiguous:
2025-05-07T20:33:19.2387800Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2387883Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2387952Z     
2025-05-07T20:33:19.2388038Z         if scale_ub is not None:
2025-05-07T20:33:19.2388138Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2388267Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2388341Z             )
2025-05-07T20:33:19.2388413Z         else:
2025-05-07T20:33:19.2388502Z             scale_ub_tensor = None
2025-05-07T20:33:19.2388573Z     
2025-05-07T20:33:19.2388698Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2388827Z             op = silu_mul_quant
2025-05-07T20:33:19.2388907Z             if compiled:
2025-05-07T20:33:19.2389001Z                 op = torch.compile(op)
2025-05-07T20:33:19.2389106Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2389176Z     
2025-05-07T20:33:19.2389263Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2389267Z 
2025-05-07T20:33:19.2389361Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2389483Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2389575Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2389670Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2390029Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.2390122Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.2390606Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2390704Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2391058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2391278Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2391609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2391699Z     kernel = self.compile(
2025-05-07T20:33:19.2392097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2392266Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2392386Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2392390Z 
2025-05-07T20:33:19.2392584Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d94b4410>
2025-05-07T20:33:19.2393394Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2393886Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d8934680>}
2025-05-07T20:33:19.2394622Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2394805Z context = <triton._C.libtriton.ir.context object at 0x7f38d99d45b0>
2025-05-07T20:33:19.2394809Z 
2025-05-07T20:33:19.2394968Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2395221Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2395325Z                            module_map=module_map)
2025-05-07T20:33:19.2395486Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2395578Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2395731Z E       ^
2025-05-07T20:33:19.2396080Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2396085Z 
2025-05-07T20:33:19.2396511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2396516Z 
2025-05-07T20:33:19.2396616Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2396830Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2396902Z     T=1,
2025-05-07T20:33:19.2396980Z     D=5120,
2025-05-07T20:33:19.2397057Z     scale_ub=None,
2025-05-07T20:33:19.2397139Z     contiguous=False,
2025-05-07T20:33:19.2397342Z     compiled=True,
2025-05-07T20:33:19.2397410Z )
2025-05-07T20:33:19.2397620Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2397782Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:19.2397791Z 
2025-05-07T20:33:19.2397864Z     @given(
2025-05-07T20:33:19.2397979Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2398072Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2398180Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2398293Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2398400Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2398466Z     )
2025-05-07T20:33:19.2398702Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2398789Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2398865Z         self,
2025-05-07T20:33:19.2398940Z         T: int,
2025-05-07T20:33:19.2399011Z         D: int,
2025-05-07T20:33:19.2399108Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2399191Z         contiguous: bool,
2025-05-07T20:33:19.2399270Z         compiled: bool,
2025-05-07T20:33:19.2399346Z     ) -> None:
2025-05-07T20:33:19.2399437Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2399505Z     
2025-05-07T20:33:19.2399669Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2399737Z     
2025-05-07T20:33:19.2399826Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2399946Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2400029Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2400102Z         x0 = x[:, :D]
2025-05-07T20:33:19.2400181Z         x1 = x[:, D:]
2025-05-07T20:33:19.2400249Z     
2025-05-07T20:33:19.2400333Z         if contiguous:
2025-05-07T20:33:19.2400417Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2400498Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2400571Z     
2025-05-07T20:33:19.2400659Z         if scale_ub is not None:
2025-05-07T20:33:19.2400760Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2400973Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2401053Z             )
2025-05-07T20:33:19.2401126Z         else:
2025-05-07T20:33:19.2401218Z             scale_ub_tensor = None
2025-05-07T20:33:19.2401287Z     
2025-05-07T20:33:19.2401410Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2401497Z             op = silu_mul_quant
2025-05-07T20:33:19.2401579Z             if compiled:
2025-05-07T20:33:19.2401676Z                 op = torch.compile(op)
2025-05-07T20:33:19.2401777Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2401847Z     
2025-05-07T20:33:19.2401939Z         y_fp8, y_scale = fn()
2025-05-07T20:33:19.2402057Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:19.2402126Z     
2025-05-07T20:33:19.2402266Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2402364Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:19.2402466Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:19.2402634Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:19.2402807Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.2402877Z     
2025-05-07T20:33:19.2402978Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:19.2402983Z 
2025-05-07T20:33:19.2403075Z moe/activation_test.py:126: 
2025-05-07T20:33:19.2403198Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2403298Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:19.2403429Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.2403988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:19.2404125Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:19.2404488Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2404707Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2405069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:19.2405324Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:19.2405702Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:19.2405867Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:19.2406257Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:19.2406332Z     fn()
2025-05-07T20:33:19.2406738Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:19.2406816Z     self.fn.run(
2025-05-07T20:33:19.2407152Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2407250Z     kernel = self.compile(
2025-05-07T20:33:19.2407626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2407793Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2407921Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2407926Z 
2025-05-07T20:33:19.2408125Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d94b6b10>
2025-05-07T20:33:19.2408891Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2409427Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d892ad40>}
2025-05-07T20:33:19.2410168Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2410350Z context = <triton._C.libtriton.ir.context object at 0x7f38d8626330>
2025-05-07T20:33:19.2410355Z 
2025-05-07T20:33:19.2410512Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2410773Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2410877Z                            module_map=module_map)
2025-05-07T20:33:19.2411039Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2411140Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:19.2411213Z E       ^
2025-05-07T20:33:19.2411607Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2411648Z 
2025-05-07T20:33:19.2412058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2412063Z 
2025-05-07T20:33:19.2412161Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2412383Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2412459Z     T=1,
2025-05-07T20:33:19.2412538Z     D=5120,
2025-05-07T20:33:19.2412618Z     scale_ub=None,
2025-05-07T20:33:19.2412698Z     contiguous=True,
2025-05-07T20:33:19.2412780Z     compiled=False,
2025-05-07T20:33:19.2412851Z )
2025-05-07T20:33:19.2413066Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2413274Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:19.2413279Z 
2025-05-07T20:33:19.2413354Z     @given(
2025-05-07T20:33:19.2413471Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2413575Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2413686Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2413804Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2413913Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2413986Z     )
2025-05-07T20:33:19.2414228Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2414318Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2414396Z         self,
2025-05-07T20:33:19.2414474Z         T: int,
2025-05-07T20:33:19.2414549Z         D: int,
2025-05-07T20:33:19.2414642Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2414734Z         contiguous: bool,
2025-05-07T20:33:19.2414817Z         compiled: bool,
2025-05-07T20:33:19.2414892Z     ) -> None:
2025-05-07T20:33:19.2414987Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2415061Z     
2025-05-07T20:33:19.2415232Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2415305Z     
2025-05-07T20:33:19.2415393Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2415518Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2415605Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2415680Z         x0 = x[:, :D]
2025-05-07T20:33:19.2415761Z         x1 = x[:, D:]
2025-05-07T20:33:19.2415828Z     
2025-05-07T20:33:19.2415908Z         if contiguous:
2025-05-07T20:33:19.2416016Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2416110Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2416192Z     
2025-05-07T20:33:19.2416294Z         if scale_ub is not None:
2025-05-07T20:33:19.2416396Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2416539Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2416613Z             )
2025-05-07T20:33:19.2416688Z         else:
2025-05-07T20:33:19.2416832Z             scale_ub_tensor = None
2025-05-07T20:33:19.2416904Z     
2025-05-07T20:33:19.2417030Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2417123Z             op = silu_mul_quant
2025-05-07T20:33:19.2417204Z             if compiled:
2025-05-07T20:33:19.2417299Z                 op = torch.compile(op)
2025-05-07T20:33:19.2417407Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2417479Z     
2025-05-07T20:33:19.2417564Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2417569Z 
2025-05-07T20:33:19.2417667Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2417793Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2417896Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2417993Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2418486Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2418587Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2419027Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2419246Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2419589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2419681Z     kernel = self.compile(
2025-05-07T20:33:19.2420078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2420248Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2420370Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2420414Z 
2025-05-07T20:33:19.2420621Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d8ff7540>
2025-05-07T20:33:19.2421389Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2421885Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d9226660>}
2025-05-07T20:33:19.2422617Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2422808Z context = <triton._C.libtriton.ir.context object at 0x7f38d851c2f0>
2025-05-07T20:33:19.2422812Z 
2025-05-07T20:33:19.2422974Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2423233Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2423348Z                            module_map=module_map)
2025-05-07T20:33:19.2423505Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2423599Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2423681Z E       ^
2025-05-07T20:33:19.2424029Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2424034Z 
2025-05-07T20:33:19.2424469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2424474Z 
2025-05-07T20:33:19.2424572Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2424787Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2424871Z     T=128,
2025-05-07T20:33:19.2424944Z     D=5120,
2025-05-07T20:33:19.2425021Z     scale_ub=None,
2025-05-07T20:33:19.2425109Z     contiguous=False,
2025-05-07T20:33:19.2425234Z     compiled=True,
2025-05-07T20:33:19.2425306Z )
2025-05-07T20:33:19.2425524Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2425688Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:19.2425693Z 
2025-05-07T20:33:19.2425771Z     @given(
2025-05-07T20:33:19.2425881Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2425976Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2426093Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2426206Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2426315Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2426392Z     )
2025-05-07T20:33:19.2426629Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2426726Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2426801Z         self,
2025-05-07T20:33:19.2426876Z         T: int,
2025-05-07T20:33:19.2426955Z         D: int,
2025-05-07T20:33:19.2427133Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2427221Z         contiguous: bool,
2025-05-07T20:33:19.2427306Z         compiled: bool,
2025-05-07T20:33:19.2427380Z     ) -> None:
2025-05-07T20:33:19.2427523Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2427597Z     
2025-05-07T20:33:19.2427762Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2427837Z     
2025-05-07T20:33:19.2427931Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2428049Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2428136Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2428213Z         x0 = x[:, :D]
2025-05-07T20:33:19.2428290Z         x1 = x[:, D:]
2025-05-07T20:33:19.2428442Z     
2025-05-07T20:33:19.2428520Z         if contiguous:
2025-05-07T20:33:19.2428607Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2428694Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2428766Z     
2025-05-07T20:33:19.2428855Z         if scale_ub is not None:
2025-05-07T20:33:19.2428965Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2429094Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2429167Z             )
2025-05-07T20:33:19.2429241Z         else:
2025-05-07T20:33:19.2429332Z             scale_ub_tensor = None
2025-05-07T20:33:19.2429401Z     
2025-05-07T20:33:19.2429529Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2429616Z             op = silu_mul_quant
2025-05-07T20:33:19.2429703Z             if compiled:
2025-05-07T20:33:19.2429801Z                 op = torch.compile(op)
2025-05-07T20:33:19.2429903Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2429979Z     
2025-05-07T20:33:19.2430070Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2430074Z 
2025-05-07T20:33:19.2430169Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2430305Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2430405Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2430500Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2430869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.2430959Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.2431452Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2431546Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2431901Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2432126Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2432466Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2432606Z     kernel = self.compile(
2025-05-07T20:33:19.2433011Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2433184Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2433312Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2433317Z 
2025-05-07T20:33:19.2433515Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d8940e10>
2025-05-07T20:33:19.2434274Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2434769Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d892bb00>}
2025-05-07T20:33:19.2435550Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2435779Z context = <triton._C.libtriton.ir.context object at 0x7f38d86a4370>
2025-05-07T20:33:19.2435784Z 
2025-05-07T20:33:19.2435942Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2436205Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2436309Z                            module_map=module_map)
2025-05-07T20:33:19.2436466Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2436571Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2436652Z E       ^
2025-05-07T20:33:19.2437041Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2437046Z 
2025-05-07T20:33:19.2437471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2437477Z 
2025-05-07T20:33:19.2437579Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2437804Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2437881Z     T=128,
2025-05-07T20:33:19.2437957Z     D=7168,
2025-05-07T20:33:19.2438042Z     scale_ub=1200.0,
2025-05-07T20:33:19.2438125Z     contiguous=False,
2025-05-07T20:33:19.2438208Z     compiled=False,
2025-05-07T20:33:19.2438284Z )
2025-05-07T20:33:19.2438496Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2438671Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:19.2438678Z 
2025-05-07T20:33:19.2438755Z     @given(
2025-05-07T20:33:19.2438870Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2438978Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2439095Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2439209Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2439321Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2439393Z     )
2025-05-07T20:33:19.2439629Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2439726Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2439801Z         self,
2025-05-07T20:33:19.2439880Z         T: int,
2025-05-07T20:33:19.2439958Z         D: int,
2025-05-07T20:33:19.2440053Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2440357Z         contiguous: bool,
2025-05-07T20:33:19.2440479Z         compiled: bool,
2025-05-07T20:33:19.2440582Z     ) -> None:
2025-05-07T20:33:19.2440685Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2440755Z     
2025-05-07T20:33:19.2440923Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2441090Z     
2025-05-07T20:33:19.2441185Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2441311Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2441404Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2441483Z         x0 = x[:, :D]
2025-05-07T20:33:19.2441567Z         x1 = x[:, D:]
2025-05-07T20:33:19.2441640Z     
2025-05-07T20:33:19.2441720Z         if contiguous:
2025-05-07T20:33:19.2441819Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2441908Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2441976Z     
2025-05-07T20:33:19.2442069Z         if scale_ub is not None:
2025-05-07T20:33:19.2442173Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2442304Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2442381Z             )
2025-05-07T20:33:19.2442459Z         else:
2025-05-07T20:33:19.2442550Z             scale_ub_tensor = None
2025-05-07T20:33:19.2442627Z     
2025-05-07T20:33:19.2442757Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2442964Z             op = silu_mul_quant
2025-05-07T20:33:19.2443051Z             if compiled:
2025-05-07T20:33:19.2443147Z                 op = torch.compile(op)
2025-05-07T20:33:19.2443253Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2443323Z     
2025-05-07T20:33:19.2443413Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2443417Z 
2025-05-07T20:33:19.2443515Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2443639Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2443738Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2443839Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2444326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2444488Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2444845Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2445067Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2445408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2445503Z     kernel = self.compile(
2025-05-07T20:33:19.2445905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2446084Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2446207Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2446212Z 
2025-05-07T20:33:19.2446420Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d98347d0>
2025-05-07T20:33:19.2447192Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2447685Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d8d66200>}
2025-05-07T20:33:19.2448421Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2448607Z context = <triton._C.libtriton.ir.context object at 0x7f38d8c0faf0>
2025-05-07T20:33:19.2448611Z 
2025-05-07T20:33:19.2448773Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2449029Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2449139Z                            module_map=module_map)
2025-05-07T20:33:19.2449350Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2449456Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2449541Z E       ^
2025-05-07T20:33:19.2449890Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2449894Z 
2025-05-07T20:33:19.2450308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2450312Z 
2025-05-07T20:33:19.2450419Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2450634Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2450716Z     T=128,
2025-05-07T20:33:19.2450792Z     D=5120,
2025-05-07T20:33:19.2450873Z     scale_ub=None,
2025-05-07T20:33:19.2450962Z     contiguous=False,
2025-05-07T20:33:19.2451045Z     compiled=False,
2025-05-07T20:33:19.2451116Z )
2025-05-07T20:33:19.2451340Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2451587Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:19.2451592Z 
2025-05-07T20:33:19.2451670Z     @given(
2025-05-07T20:33:19.2451792Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2451889Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2452008Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2452121Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2452229Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2452304Z     )
2025-05-07T20:33:19.2452542Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2452631Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2452750Z         self,
2025-05-07T20:33:19.2452826Z         T: int,
2025-05-07T20:33:19.2452902Z         D: int,
2025-05-07T20:33:19.2453001Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2453090Z         contiguous: bool,
2025-05-07T20:33:19.2453174Z         compiled: bool,
2025-05-07T20:33:19.2453256Z     ) -> None:
2025-05-07T20:33:19.2453345Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2453420Z     
2025-05-07T20:33:19.2453582Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2453654Z     
2025-05-07T20:33:19.2453748Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2453869Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2453954Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2454034Z         x0 = x[:, :D]
2025-05-07T20:33:19.2454111Z         x1 = x[:, D:]
2025-05-07T20:33:19.2454183Z     
2025-05-07T20:33:19.2454267Z         if contiguous:
2025-05-07T20:33:19.2454359Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2454450Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2454525Z     
2025-05-07T20:33:19.2454613Z         if scale_ub is not None:
2025-05-07T20:33:19.2454718Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2454860Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2454932Z             )
2025-05-07T20:33:19.2455014Z         else:
2025-05-07T20:33:19.2455105Z             scale_ub_tensor = None
2025-05-07T20:33:19.2455178Z     
2025-05-07T20:33:19.2455312Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2455400Z             op = silu_mul_quant
2025-05-07T20:33:19.2455482Z             if compiled:
2025-05-07T20:33:19.2455591Z                 op = torch.compile(op)
2025-05-07T20:33:19.2455693Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2455765Z     
2025-05-07T20:33:19.2455860Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2455864Z 
2025-05-07T20:33:19.2455958Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2456091Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2456186Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2456330Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2456830Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2456922Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2457275Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2457498Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2457832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2457929Z     kernel = self.compile(
2025-05-07T20:33:19.2458308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2458480Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2458611Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2458657Z 
2025-05-07T20:33:19.2458918Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d8934af0>
2025-05-07T20:33:19.2459686Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2460175Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d8935940>}
2025-05-07T20:33:19.2460905Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2461136Z context = <triton._C.libtriton.ir.context object at 0x7f38d873ba30>
2025-05-07T20:33:19.2461144Z 
2025-05-07T20:33:19.2461309Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2461574Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2461678Z                            module_map=module_map)
2025-05-07T20:33:19.2461836Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2461936Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2462012Z E       ^
2025-05-07T20:33:19.2462359Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2462367Z 
2025-05-07T20:33:19.2462775Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2462783Z 
2025-05-07T20:33:19.2462884Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2463110Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2466787Z     T=128,
2025-05-07T20:33:19.2466886Z     D=5120,
2025-05-07T20:33:19.2466976Z     scale_ub=1200.0,
2025-05-07T20:33:19.2467062Z     contiguous=True,
2025-05-07T20:33:19.2467144Z     compiled=False,
2025-05-07T20:33:19.2467220Z )
2025-05-07T20:33:19.2467507Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2467689Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:19.2467694Z 
2025-05-07T20:33:19.2467775Z     @given(
2025-05-07T20:33:19.2467895Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2467994Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2468107Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2468221Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2468338Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2468413Z     )
2025-05-07T20:33:19.2468725Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2468824Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2468902Z         self,
2025-05-07T20:33:19.2468985Z         T: int,
2025-05-07T20:33:19.2469062Z         D: int,
2025-05-07T20:33:19.2469159Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2469251Z         contiguous: bool,
2025-05-07T20:33:19.2469336Z         compiled: bool,
2025-05-07T20:33:19.2469417Z     ) -> None:
2025-05-07T20:33:19.2469513Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2469587Z     
2025-05-07T20:33:19.2469753Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2469835Z     
2025-05-07T20:33:19.2469926Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2470050Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2470140Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2470218Z         x0 = x[:, :D]
2025-05-07T20:33:19.2470299Z         x1 = x[:, D:]
2025-05-07T20:33:19.2470371Z     
2025-05-07T20:33:19.2470497Z         if contiguous:
2025-05-07T20:33:19.2470627Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2470713Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2470784Z     
2025-05-07T20:33:19.2470871Z         if scale_ub is not None:
2025-05-07T20:33:19.2470972Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2471103Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2471181Z             )
2025-05-07T20:33:19.2471257Z         else:
2025-05-07T20:33:19.2471348Z             scale_ub_tensor = None
2025-05-07T20:33:19.2471422Z     
2025-05-07T20:33:19.2471549Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2471646Z             op = silu_mul_quant
2025-05-07T20:33:19.2471729Z             if compiled:
2025-05-07T20:33:19.2471874Z                 op = torch.compile(op)
2025-05-07T20:33:19.2471981Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2472054Z     
2025-05-07T20:33:19.2472145Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2472153Z 
2025-05-07T20:33:19.2472256Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2472381Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2472478Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2472577Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2473070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2473166Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2473519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2473733Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2474076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2474169Z     kernel = self.compile(
2025-05-07T20:33:19.2474555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2474727Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2474848Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2474852Z 
2025-05-07T20:33:19.2475054Z self = <triton.compiler.compiler.ASTSource object at 0x7f38fef01520>
2025-05-07T20:33:19.2475815Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2476362Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d872cc20>}
2025-05-07T20:33:19.2477147Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2477337Z context = <triton._C.libtriton.ir.context object at 0x7f38d8c1dc30>
2025-05-07T20:33:19.2477342Z 
2025-05-07T20:33:19.2477506Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2477762Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2477869Z                            module_map=module_map)
2025-05-07T20:33:19.2478026Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2478120Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2478205Z E       ^
2025-05-07T20:33:19.2478552Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2478559Z 
2025-05-07T20:33:19.2479031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2479076Z 
2025-05-07T20:33:19.2479175Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2479388Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2479464Z     T=1,
2025-05-07T20:33:19.2479538Z     D=7168,
2025-05-07T20:33:19.2479614Z     scale_ub=1200.0,
2025-05-07T20:33:19.2479695Z     contiguous=True,
2025-05-07T20:33:19.2479775Z     compiled=True,
2025-05-07T20:33:19.2479844Z )
2025-05-07T20:33:19.2480059Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2480217Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:19.2480263Z 
2025-05-07T20:33:19.2480336Z     @given(
2025-05-07T20:33:19.2480453Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2480550Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2480667Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2480780Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2480888Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2480961Z     )
2025-05-07T20:33:19.2481196Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2481283Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2481354Z         self,
2025-05-07T20:33:19.2481424Z         T: int,
2025-05-07T20:33:19.2481494Z         D: int,
2025-05-07T20:33:19.2481588Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2481673Z         contiguous: bool,
2025-05-07T20:33:19.2481760Z         compiled: bool,
2025-05-07T20:33:19.2481836Z     ) -> None:
2025-05-07T20:33:19.2481930Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2482000Z     
2025-05-07T20:33:19.2482163Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2482233Z     
2025-05-07T20:33:19.2482326Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2482449Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2482535Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2482616Z         x0 = x[:, :D]
2025-05-07T20:33:19.2482690Z         x1 = x[:, D:]
2025-05-07T20:33:19.2482760Z     
2025-05-07T20:33:19.2482845Z         if contiguous:
2025-05-07T20:33:19.2482933Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2483016Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2483089Z     
2025-05-07T20:33:19.2483174Z         if scale_ub is not None:
2025-05-07T20:33:19.2483280Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2483410Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2483490Z             )
2025-05-07T20:33:19.2483568Z         else:
2025-05-07T20:33:19.2483657Z             scale_ub_tensor = None
2025-05-07T20:33:19.2483726Z     
2025-05-07T20:33:19.2483920Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2484012Z             op = silu_mul_quant
2025-05-07T20:33:19.2484093Z             if compiled:
2025-05-07T20:33:19.2484191Z                 op = torch.compile(op)
2025-05-07T20:33:19.2484290Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2484359Z     
2025-05-07T20:33:19.2484446Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2484451Z 
2025-05-07T20:33:19.2484543Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2484671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2484766Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2484862Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2485223Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.2485314Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.2485799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2485976Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2486329Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2486547Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2486883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2486973Z     kernel = self.compile(
2025-05-07T20:33:19.2487354Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2487521Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2487687Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2487692Z 
2025-05-07T20:33:19.2487891Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d8c142d0>
2025-05-07T20:33:19.2488655Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2489151Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d872dee0>}
2025-05-07T20:33:19.2489882Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2490067Z context = <triton._C.libtriton.ir.context object at 0x7f38d8c7feb0>
2025-05-07T20:33:19.2490074Z 
2025-05-07T20:33:19.2490231Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2490491Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2490598Z                            module_map=module_map)
2025-05-07T20:33:19.2490754Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2490849Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2490923Z E       ^
2025-05-07T20:33:19.2491267Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2491272Z 
2025-05-07T20:33:19.2491684Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2491689Z 
2025-05-07T20:33:19.2491786Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2492004Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2492080Z     T=1,
2025-05-07T20:33:19.2492152Z     D=7168,
2025-05-07T20:33:19.2492275Z     scale_ub=1200.0,
2025-05-07T20:33:19.2492357Z     contiguous=False,
2025-05-07T20:33:19.2492440Z     compiled=True,
2025-05-07T20:33:19.2492510Z )
2025-05-07T20:33:19.2492721Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2492883Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:19.2492887Z 
2025-05-07T20:33:19.2492963Z     @given(
2025-05-07T20:33:19.2493077Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2493172Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2493285Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2493395Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2493504Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2493579Z     )
2025-05-07T20:33:19.2493815Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2493904Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2493980Z         self,
2025-05-07T20:33:19.2494125Z         T: int,
2025-05-07T20:33:19.2494239Z         D: int,
2025-05-07T20:33:19.2494334Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2494422Z         contiguous: bool,
2025-05-07T20:33:19.2494510Z         compiled: bool,
2025-05-07T20:33:19.2494587Z     ) -> None:
2025-05-07T20:33:19.2494687Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2494756Z     
2025-05-07T20:33:19.2494920Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2494996Z     
2025-05-07T20:33:19.2495085Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2495205Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2495296Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2495374Z         x0 = x[:, :D]
2025-05-07T20:33:19.2495491Z         x1 = x[:, D:]
2025-05-07T20:33:19.2495566Z     
2025-05-07T20:33:19.2495649Z         if contiguous:
2025-05-07T20:33:19.2495737Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2495831Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2495906Z     
2025-05-07T20:33:19.2495994Z         if scale_ub is not None:
2025-05-07T20:33:19.2496098Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2496229Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2496306Z             )
2025-05-07T20:33:19.2496379Z         else:
2025-05-07T20:33:19.2496469Z             scale_ub_tensor = None
2025-05-07T20:33:19.2496543Z     
2025-05-07T20:33:19.2496671Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2496759Z             op = silu_mul_quant
2025-05-07T20:33:19.2496845Z             if compiled:
2025-05-07T20:33:19.2496943Z                 op = torch.compile(op)
2025-05-07T20:33:19.2497045Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2497118Z     
2025-05-07T20:33:19.2497207Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2497211Z 
2025-05-07T20:33:19.2497309Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2497436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2497532Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2497629Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2497993Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.2498082Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.2498570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2498665Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2499023Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2499242Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2499622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2499722Z     kernel = self.compile(
2025-05-07T20:33:19.2500115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2500284Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2500408Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2500413Z 
2025-05-07T20:33:19.2500608Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d8c14d50>
2025-05-07T20:33:19.2501373Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2501867Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d872ec00>}
2025-05-07T20:33:19.2502681Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2502865Z context = <triton._C.libtriton.ir.context object at 0x7f38d8cec0b0>
2025-05-07T20:33:19.2502869Z 
2025-05-07T20:33:19.2503028Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2503286Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2503390Z                            module_map=module_map)
2025-05-07T20:33:19.2503551Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2503686Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2503759Z E       ^
2025-05-07T20:33:19.2504112Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2504119Z 
2025-05-07T20:33:19.2504533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2504537Z 
2025-05-07T20:33:19.2504640Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2504858Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2504932Z     T=1,
2025-05-07T20:33:19.2505010Z     D=7168,
2025-05-07T20:33:19.2505088Z     scale_ub=None,
2025-05-07T20:33:19.2505169Z     contiguous=False,
2025-05-07T20:33:19.2505253Z     compiled=True,
2025-05-07T20:33:19.2505321Z )
2025-05-07T20:33:19.2505533Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2505694Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:19.2505701Z 
2025-05-07T20:33:19.2505773Z     @given(
2025-05-07T20:33:19.2505890Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2505990Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2506104Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2506219Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2506328Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2506396Z     )
2025-05-07T20:33:19.2506636Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2506724Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2506796Z         self,
2025-05-07T20:33:19.2506872Z         T: int,
2025-05-07T20:33:19.2506943Z         D: int,
2025-05-07T20:33:19.2507036Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2507123Z         contiguous: bool,
2025-05-07T20:33:19.2507202Z         compiled: bool,
2025-05-07T20:33:19.2507280Z     ) -> None:
2025-05-07T20:33:19.2507372Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2507495Z     
2025-05-07T20:33:19.2507706Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2507785Z     
2025-05-07T20:33:19.2507874Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2507996Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2508079Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2508155Z         x0 = x[:, :D]
2025-05-07T20:33:19.2508233Z         x1 = x[:, D:]
2025-05-07T20:33:19.2508303Z     
2025-05-07T20:33:19.2508379Z         if contiguous:
2025-05-07T20:33:19.2508468Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2508553Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2508619Z     
2025-05-07T20:33:19.2508710Z         if scale_ub is not None:
2025-05-07T20:33:19.2508810Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2508939Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2509020Z             )
2025-05-07T20:33:19.2509090Z         else:
2025-05-07T20:33:19.2509183Z             scale_ub_tensor = None
2025-05-07T20:33:19.2509257Z     
2025-05-07T20:33:19.2509460Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2509549Z             op = silu_mul_quant
2025-05-07T20:33:19.2509629Z             if compiled:
2025-05-07T20:33:19.2509722Z                 op = torch.compile(op)
2025-05-07T20:33:19.2509826Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2509895Z     
2025-05-07T20:33:19.2509983Z         y_fp8, y_scale = fn()
2025-05-07T20:33:19.2510101Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:19.2510170Z     
2025-05-07T20:33:19.2510300Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2510399Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:19.2510492Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:19.2510650Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:19.2510783Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.2510854Z     
2025-05-07T20:33:19.2510955Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:19.2510965Z 
2025-05-07T20:33:19.2511057Z moe/activation_test.py:126: 
2025-05-07T20:33:19.2511179Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2511283Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:19.2511412Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:19.2511960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:19.2512056Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:19.2512407Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2512629Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2513008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:19.2513262Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:19.2513633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:19.2513792Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:19.2514129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:19.2514204Z     fn()
2025-05-07T20:33:19.2514598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:19.2514678Z     self.fn.run(
2025-05-07T20:33:19.2515009Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2515102Z     kernel = self.compile(
2025-05-07T20:33:19.2515521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2515693Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2515817Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2515821Z 
2025-05-07T20:33:19.2516018Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d9a11150>
2025-05-07T20:33:19.2516777Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2517267Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d8518180>}
2025-05-07T20:33:19.2518038Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2518260Z context = <triton._C.libtriton.ir.context object at 0x7f38d857ca70>
2025-05-07T20:33:19.2518265Z 
2025-05-07T20:33:19.2518426Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2518682Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2518786Z                            module_map=module_map)
2025-05-07T20:33:19.2518940Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2519038Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:19.2519113Z E       ^
2025-05-07T20:33:19.2519456Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2519500Z 
2025-05-07T20:33:19.2519918Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2519925Z 
2025-05-07T20:33:19.2520026Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2520241Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2520314Z     T=1,
2025-05-07T20:33:19.2520387Z     D=5120,
2025-05-07T20:33:19.2520470Z     scale_ub=1200.0,
2025-05-07T20:33:19.2520552Z     contiguous=False,
2025-05-07T20:33:19.2520629Z     compiled=True,
2025-05-07T20:33:19.2520701Z )
2025-05-07T20:33:19.2520914Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2521076Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:19.2521081Z 
2025-05-07T20:33:19.2521154Z     @given(
2025-05-07T20:33:19.2521268Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2521366Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2521479Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2521591Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2521706Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2521778Z     )
2025-05-07T20:33:19.2522012Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2522101Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2522172Z         self,
2025-05-07T20:33:19.2522244Z         T: int,
2025-05-07T20:33:19.2522315Z         D: int,
2025-05-07T20:33:19.2522406Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2522494Z         contiguous: bool,
2025-05-07T20:33:19.2522573Z         compiled: bool,
2025-05-07T20:33:19.2522646Z     ) -> None:
2025-05-07T20:33:19.2522739Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2522808Z     
2025-05-07T20:33:19.2522973Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2523045Z     
2025-05-07T20:33:19.2523132Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2523333Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2523426Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2523501Z         x0 = x[:, :D]
2025-05-07T20:33:19.2523584Z         x1 = x[:, D:]
2025-05-07T20:33:19.2523656Z     
2025-05-07T20:33:19.2523734Z         if contiguous:
2025-05-07T20:33:19.2523822Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2523905Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2523976Z     
2025-05-07T20:33:19.2524065Z         if scale_ub is not None:
2025-05-07T20:33:19.2524168Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2524298Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2524378Z             )
2025-05-07T20:33:19.2524453Z         else:
2025-05-07T20:33:19.2524543Z             scale_ub_tensor = None
2025-05-07T20:33:19.2524620Z     
2025-05-07T20:33:19.2524748Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2524838Z             op = silu_mul_quant
2025-05-07T20:33:19.2524924Z             if compiled:
2025-05-07T20:33:19.2525177Z                 op = torch.compile(op)
2025-05-07T20:33:19.2525284Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2525356Z     
2025-05-07T20:33:19.2525448Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2525452Z 
2025-05-07T20:33:19.2525552Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2525676Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2525771Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2525872Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2526237Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.2526328Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.2526857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2526953Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2527314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2527532Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2527868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2527965Z     kernel = self.compile(
2025-05-07T20:33:19.2528363Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2528538Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2528661Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2528668Z 
2025-05-07T20:33:19.2528869Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d86cabd0>
2025-05-07T20:33:19.2529643Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2530133Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d8519300>}
2025-05-07T20:33:19.2530869Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2531054Z context = <triton._C.libtriton.ir.context object at 0x7f38d82a78f0>
2025-05-07T20:33:19.2531059Z 
2025-05-07T20:33:19.2531216Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2531481Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2531625Z                            module_map=module_map)
2025-05-07T20:33:19.2531792Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2531889Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2531966Z E       ^
2025-05-07T20:33:19.2532319Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2532324Z 
2025-05-07T20:33:19.2532735Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2532739Z 
2025-05-07T20:33:19.2532842Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2533061Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2533136Z     T=1,
2025-05-07T20:33:19.2533210Z     D=5120,
2025-05-07T20:33:19.2533290Z     scale_ub=1200.0,
2025-05-07T20:33:19.2533375Z     contiguous=False,
2025-05-07T20:33:19.2533469Z     compiled=False,
2025-05-07T20:33:19.2533539Z )
2025-05-07T20:33:19.2533840Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2534010Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:19.2534015Z 
2025-05-07T20:33:19.2534088Z     @given(
2025-05-07T20:33:19.2534208Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2534303Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2534414Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2534533Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2534641Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2534714Z     )
2025-05-07T20:33:19.2534955Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2535088Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2535164Z         self,
2025-05-07T20:33:19.2535242Z         T: int,
2025-05-07T20:33:19.2535319Z         D: int,
2025-05-07T20:33:19.2535415Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2535511Z         contiguous: bool,
2025-05-07T20:33:19.2535593Z         compiled: bool,
2025-05-07T20:33:19.2535670Z     ) -> None:
2025-05-07T20:33:19.2535758Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2535828Z     
2025-05-07T20:33:19.2535991Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2536062Z     
2025-05-07T20:33:19.2536148Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2536273Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2536357Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2536432Z         x0 = x[:, :D]
2025-05-07T20:33:19.2536514Z         x1 = x[:, D:]
2025-05-07T20:33:19.2536582Z     
2025-05-07T20:33:19.2536663Z         if contiguous:
2025-05-07T20:33:19.2536751Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2536834Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2536908Z     
2025-05-07T20:33:19.2536995Z         if scale_ub is not None:
2025-05-07T20:33:19.2537102Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2537237Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2537312Z             )
2025-05-07T20:33:19.2537387Z         else:
2025-05-07T20:33:19.2537481Z             scale_ub_tensor = None
2025-05-07T20:33:19.2537552Z     
2025-05-07T20:33:19.2537675Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2537767Z             op = silu_mul_quant
2025-05-07T20:33:19.2537847Z             if compiled:
2025-05-07T20:33:19.2537942Z                 op = torch.compile(op)
2025-05-07T20:33:19.2538046Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2538116Z     
2025-05-07T20:33:19.2538207Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2538214Z 
2025-05-07T20:33:19.2538306Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2538474Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2538573Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2538675Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2539166Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2539264Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2539620Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2539841Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2540408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2540541Z     kernel = self.compile(
2025-05-07T20:33:19.2540928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2541102Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2541375Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2541387Z 
2025-05-07T20:33:19.2541585Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d98952d0>
2025-05-07T20:33:19.2542347Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2542837Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d851a020>}
2025-05-07T20:33:19.2543566Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2543822Z context = <triton._C.libtriton.ir.context object at 0x7f38d88bd5b0>
2025-05-07T20:33:19.2543832Z 
2025-05-07T20:33:19.2543991Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2544247Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2544354Z                            module_map=module_map)
2025-05-07T20:33:19.2544511Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2544610Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2544687Z E       ^
2025-05-07T20:33:19.2545032Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2545037Z 
2025-05-07T20:33:19.2545449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2545456Z 
2025-05-07T20:33:19.2545555Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2545770Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2545846Z     T=16384,
2025-05-07T20:33:19.2545917Z     D=5120,
2025-05-07T20:33:19.2545996Z     scale_ub=1200.0,
2025-05-07T20:33:19.2546079Z     contiguous=False,
2025-05-07T20:33:19.2546157Z     compiled=True,
2025-05-07T20:33:19.2546228Z )
2025-05-07T20:33:19.2546440Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2546611Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:19.2546616Z 
2025-05-07T20:33:19.2546693Z     @given(
2025-05-07T20:33:19.2546806Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2546902Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2547017Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2547126Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2547304Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2547383Z     )
2025-05-07T20:33:19.2547680Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2547771Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2547842Z         self,
2025-05-07T20:33:19.2547915Z         T: int,
2025-05-07T20:33:19.2547988Z         D: int,
2025-05-07T20:33:19.2548081Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2548163Z         contiguous: bool,
2025-05-07T20:33:19.2548249Z         compiled: bool,
2025-05-07T20:33:19.2548323Z     ) -> None:
2025-05-07T20:33:19.2548411Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2548483Z     
2025-05-07T20:33:19.2548644Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2548713Z     
2025-05-07T20:33:19.2548805Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2548924Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2549016Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2549091Z         x0 = x[:, :D]
2025-05-07T20:33:19.2549251Z         x1 = x[:, D:]
2025-05-07T20:33:19.2549327Z     
2025-05-07T20:33:19.2549404Z         if contiguous:
2025-05-07T20:33:19.2549491Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2549578Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2549643Z     
2025-05-07T20:33:19.2549727Z         if scale_ub is not None:
2025-05-07T20:33:19.2549834Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2549962Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2550035Z             )
2025-05-07T20:33:19.2550114Z         else:
2025-05-07T20:33:19.2550202Z             scale_ub_tensor = None
2025-05-07T20:33:19.2550270Z     
2025-05-07T20:33:19.2550393Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2550555Z             op = silu_mul_quant
2025-05-07T20:33:19.2550638Z             if compiled:
2025-05-07T20:33:19.2550735Z                 op = torch.compile(op)
2025-05-07T20:33:19.2550835Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2550918Z     
2025-05-07T20:33:19.2551387Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2551392Z 
2025-05-07T20:33:19.2551485Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2551615Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2551709Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2551807Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2552172Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.2552263Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.2552753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2552851Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2553205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2553440Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2553776Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2553871Z     kernel = self.compile(
2025-05-07T20:33:19.2554270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2554440Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2554566Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2554570Z 
2025-05-07T20:33:19.2554764Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d814eed0>
2025-05-07T20:33:19.2555581Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2556074Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d851b600>}
2025-05-07T20:33:19.2556810Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2556995Z context = <triton._C.libtriton.ir.context object at 0x7f38d825e2f0>
2025-05-07T20:33:19.2556999Z 
2025-05-07T20:33:19.2557157Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2557418Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2557526Z                            module_map=module_map)
2025-05-07T20:33:19.2557689Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2557871Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2557946Z E       ^
2025-05-07T20:33:19.2558295Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2558299Z 
2025-05-07T20:33:19.2558708Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2558712Z 
2025-05-07T20:33:19.2558808Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2559031Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2559104Z     T=2048,
2025-05-07T20:33:19.2559183Z     D=7168,
2025-05-07T20:33:19.2559262Z     scale_ub=1200.0,
2025-05-07T20:33:19.2559387Z     contiguous=False,
2025-05-07T20:33:19.2559468Z     compiled=True,
2025-05-07T20:33:19.2559534Z )
2025-05-07T20:33:19.2559750Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2559926Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:19.2559933Z 
2025-05-07T20:33:19.2560007Z     @given(
2025-05-07T20:33:19.2560120Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2560222Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2560332Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2560447Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2560558Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2560629Z     )
2025-05-07T20:33:19.2560872Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2560961Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2561037Z         self,
2025-05-07T20:33:19.2561116Z         T: int,
2025-05-07T20:33:19.2561188Z         D: int,
2025-05-07T20:33:19.2561280Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2561371Z         contiguous: bool,
2025-05-07T20:33:19.2561452Z         compiled: bool,
2025-05-07T20:33:19.2561535Z     ) -> None:
2025-05-07T20:33:19.2561628Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2561696Z     
2025-05-07T20:33:19.2561858Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2561935Z     
2025-05-07T20:33:19.2562024Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2562148Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2562233Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2562308Z         x0 = x[:, :D]
2025-05-07T20:33:19.2562389Z         x1 = x[:, D:]
2025-05-07T20:33:19.2562460Z     
2025-05-07T20:33:19.2562538Z         if contiguous:
2025-05-07T20:33:19.2562630Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2562717Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2562785Z     
2025-05-07T20:33:19.2562877Z         if scale_ub is not None:
2025-05-07T20:33:19.2563026Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2563161Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2563239Z             )
2025-05-07T20:33:19.2563312Z         else:
2025-05-07T20:33:19.2563404Z             scale_ub_tensor = None
2025-05-07T20:33:19.2563473Z     
2025-05-07T20:33:19.2563597Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2563684Z             op = silu_mul_quant
2025-05-07T20:33:19.2563764Z             if compiled:
2025-05-07T20:33:19.2563859Z                 op = torch.compile(op)
2025-05-07T20:33:19.2563966Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2564038Z     
2025-05-07T20:33:19.2564125Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2564130Z 
2025-05-07T20:33:19.2564228Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2564354Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2564456Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2564552Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2565004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.2565099Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.2565585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2565676Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2566036Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2566291Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2566641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2566774Z     kernel = self.compile(
2025-05-07T20:33:19.2567170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2567344Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2567464Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2567468Z 
2025-05-07T20:33:19.2567664Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d8c15fd0>
2025-05-07T20:33:19.2568425Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2568911Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d8268720>}
2025-05-07T20:33:19.2569648Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2569835Z context = <triton._C.libtriton.ir.context object at 0x7f38d82dd9f0>
2025-05-07T20:33:19.2569840Z 
2025-05-07T20:33:19.2569998Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2570253Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2570356Z                            module_map=module_map)
2025-05-07T20:33:19.2570514Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2570606Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2570676Z E       ^
2025-05-07T20:33:19.2571022Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2571030Z 
2025-05-07T20:33:19.2571458Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2571506Z 
2025-05-07T20:33:19.2571609Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2571826Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2571897Z     T=1,
2025-05-07T20:33:19.2571971Z     D=5120,
2025-05-07T20:33:19.2572047Z     scale_ub=None,
2025-05-07T20:33:19.2572128Z     contiguous=False,
2025-05-07T20:33:19.2572211Z     compiled=False,
2025-05-07T20:33:19.2572278Z )
2025-05-07T20:33:19.2572492Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2572652Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:19.2572657Z 
2025-05-07T20:33:19.2572730Z     @given(
2025-05-07T20:33:19.2572846Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2572943Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2573052Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2573173Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2573361Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2573435Z     )
2025-05-07T20:33:19.2573671Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2573760Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2573831Z         self,
2025-05-07T20:33:19.2573901Z         T: int,
2025-05-07T20:33:19.2573972Z         D: int,
2025-05-07T20:33:19.2574065Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2574148Z         contiguous: bool,
2025-05-07T20:33:19.2574224Z         compiled: bool,
2025-05-07T20:33:19.2574298Z     ) -> None:
2025-05-07T20:33:19.2574388Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2574453Z     
2025-05-07T20:33:19.2574617Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2574727Z     
2025-05-07T20:33:19.2574811Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2574934Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2575019Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2575102Z         x0 = x[:, :D]
2025-05-07T20:33:19.2575174Z         x1 = x[:, D:]
2025-05-07T20:33:19.2575238Z     
2025-05-07T20:33:19.2575319Z         if contiguous:
2025-05-07T20:33:19.2575405Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2575489Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2575563Z     
2025-05-07T20:33:19.2575646Z         if scale_ub is not None:
2025-05-07T20:33:19.2575744Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2575877Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2575951Z             )
2025-05-07T20:33:19.2576024Z         else:
2025-05-07T20:33:19.2576119Z             scale_ub_tensor = None
2025-05-07T20:33:19.2576195Z     
2025-05-07T20:33:19.2576345Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2576450Z             op = silu_mul_quant
2025-05-07T20:33:19.2576530Z             if compiled:
2025-05-07T20:33:19.2576629Z                 op = torch.compile(op)
2025-05-07T20:33:19.2576733Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2576808Z     
2025-05-07T20:33:19.2576897Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2576901Z 
2025-05-07T20:33:19.2576993Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2577117Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2577215Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2577308Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2577798Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2577889Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2578246Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2578514Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2578855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2578945Z     kernel = self.compile(
2025-05-07T20:33:19.2579327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2579494Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2579617Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2579621Z 
2025-05-07T20:33:19.2579817Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d894ded0>
2025-05-07T20:33:19.2580577Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2581111Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d8269120>}
2025-05-07T20:33:19.2581899Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2582087Z context = <triton._C.libtriton.ir.context object at 0x7f38d821af70>
2025-05-07T20:33:19.2582091Z 
2025-05-07T20:33:19.2582249Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2582509Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2586128Z                            module_map=module_map)
2025-05-07T20:33:19.2586380Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2586478Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2586558Z E       ^
2025-05-07T20:33:19.2586916Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2586924Z 
2025-05-07T20:33:19.2587346Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2587351Z 
2025-05-07T20:33:19.2587520Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2587745Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2587825Z     T=4096,
2025-05-07T20:33:19.2587899Z     D=7168,
2025-05-07T20:33:19.2587985Z     scale_ub=1200.0,
2025-05-07T20:33:19.2588069Z     contiguous=False,
2025-05-07T20:33:19.2588150Z     compiled=False,
2025-05-07T20:33:19.2588225Z )
2025-05-07T20:33:19.2588447Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2588623Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:19.2588630Z 
2025-05-07T20:33:19.2588709Z     @given(
2025-05-07T20:33:19.2588830Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2588932Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2589045Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2589161Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2589276Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2589349Z     )
2025-05-07T20:33:19.2589590Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2589683Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2589758Z         self,
2025-05-07T20:33:19.2589833Z         T: int,
2025-05-07T20:33:19.2589910Z         D: int,
2025-05-07T20:33:19.2590004Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2590094Z         contiguous: bool,
2025-05-07T20:33:19.2590176Z         compiled: bool,
2025-05-07T20:33:19.2590254Z     ) -> None:
2025-05-07T20:33:19.2590405Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2590481Z     
2025-05-07T20:33:19.2590649Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2590722Z     
2025-05-07T20:33:19.2590812Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2590933Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2591023Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2591107Z         x0 = x[:, :D]
2025-05-07T20:33:19.2591182Z         x1 = x[:, D:]
2025-05-07T20:33:19.2591255Z     
2025-05-07T20:33:19.2591334Z         if contiguous:
2025-05-07T20:33:19.2591419Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2591503Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2591576Z     
2025-05-07T20:33:19.2591669Z         if scale_ub is not None:
2025-05-07T20:33:19.2591773Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2591905Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2591983Z             )
2025-05-07T20:33:19.2592057Z         else:
2025-05-07T20:33:19.2592229Z             scale_ub_tensor = None
2025-05-07T20:33:19.2592302Z     
2025-05-07T20:33:19.2592426Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2592511Z             op = silu_mul_quant
2025-05-07T20:33:19.2592594Z             if compiled:
2025-05-07T20:33:19.2592690Z                 op = torch.compile(op)
2025-05-07T20:33:19.2592789Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2592862Z     
2025-05-07T20:33:19.2592948Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2592953Z 
2025-05-07T20:33:19.2593048Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2593172Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2593270Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2593412Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2593905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2594001Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2594358Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2594574Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2594915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2595003Z     kernel = self.compile(
2025-05-07T20:33:19.2595400Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2595571Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2595695Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2595699Z 
2025-05-07T20:33:19.2595904Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d9d933d0>
2025-05-07T20:33:19.2596669Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2597157Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d826a480>}
2025-05-07T20:33:19.2597888Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2598072Z context = <triton._C.libtriton.ir.context object at 0x7f38d902ee70>
2025-05-07T20:33:19.2598079Z 
2025-05-07T20:33:19.2598239Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2598535Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2598646Z                            module_map=module_map)
2025-05-07T20:33:19.2598804Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2598898Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2598973Z E       ^
2025-05-07T20:33:19.2599319Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2599323Z 
2025-05-07T20:33:19.2599734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2599738Z 
2025-05-07T20:33:19.2599836Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2600049Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2600127Z     T=16384,
2025-05-07T20:33:19.2600203Z     D=7168,
2025-05-07T20:33:19.2600282Z     scale_ub=None,
2025-05-07T20:33:19.2600366Z     contiguous=True,
2025-05-07T20:33:19.2600523Z     compiled=True,
2025-05-07T20:33:19.2600595Z )
2025-05-07T20:33:19.2600808Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2600974Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:19.2600979Z 
2025-05-07T20:33:19.2601048Z     @given(
2025-05-07T20:33:19.2601165Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2601260Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2601368Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2601481Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2601588Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2601661Z     )
2025-05-07T20:33:19.2601939Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2602025Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2602103Z         self,
2025-05-07T20:33:19.2602179Z         T: int,
2025-05-07T20:33:19.2602254Z         D: int,
2025-05-07T20:33:19.2602349Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2602430Z         contiguous: bool,
2025-05-07T20:33:19.2602507Z         compiled: bool,
2025-05-07T20:33:19.2602584Z     ) -> None:
2025-05-07T20:33:19.2602674Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2602743Z     
2025-05-07T20:33:19.2602906Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2602977Z     
2025-05-07T20:33:19.2603066Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2603183Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2603267Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2603343Z         x0 = x[:, :D]
2025-05-07T20:33:19.2603419Z         x1 = x[:, D:]
2025-05-07T20:33:19.2603490Z     
2025-05-07T20:33:19.2603571Z         if contiguous:
2025-05-07T20:33:19.2603658Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2603744Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2603818Z     
2025-05-07T20:33:19.2603903Z         if scale_ub is not None:
2025-05-07T20:33:19.2604001Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2604133Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2604205Z             )
2025-05-07T20:33:19.2604279Z         else:
2025-05-07T20:33:19.2604370Z             scale_ub_tensor = None
2025-05-07T20:33:19.2604440Z     
2025-05-07T20:33:19.2604564Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2604648Z             op = silu_mul_quant
2025-05-07T20:33:19.2604729Z             if compiled:
2025-05-07T20:33:19.2604825Z                 op = torch.compile(op)
2025-05-07T20:33:19.2604925Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2604996Z     
2025-05-07T20:33:19.2605088Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2605092Z 
2025-05-07T20:33:19.2605183Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2605354Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2605456Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2605549Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2605911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.2606000Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.2606483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2606578Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2606928Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2607147Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2607485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2607654Z     kernel = self.compile(
2025-05-07T20:33:19.2608052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2608220Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2608339Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2608343Z 
2025-05-07T20:33:19.2608541Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d86c8150>
2025-05-07T20:33:19.2609298Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2609831Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d826b740>}
2025-05-07T20:33:19.2610565Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2610751Z context = <triton._C.libtriton.ir.context object at 0x7f38d904cdb0>
2025-05-07T20:33:19.2610755Z 
2025-05-07T20:33:19.2610910Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2611163Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2611268Z                            module_map=module_map)
2025-05-07T20:33:19.2611426Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2611518Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2611598Z E       ^
2025-05-07T20:33:19.2611946Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2611953Z 
2025-05-07T20:33:19.2612389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2612393Z 
2025-05-07T20:33:19.2612489Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2612703Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2612779Z     T=4096,
2025-05-07T20:33:19.2612849Z     D=5120,
2025-05-07T20:33:19.2612925Z     scale_ub=None,
2025-05-07T20:33:19.2613011Z     contiguous=False,
2025-05-07T20:33:19.2613086Z     compiled=True,
2025-05-07T20:33:19.2613156Z )
2025-05-07T20:33:19.2613365Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2613529Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:19.2613537Z 
2025-05-07T20:33:19.2613608Z     @given(
2025-05-07T20:33:19.2613767Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2613863Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2613983Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2614095Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2614202Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2614274Z     )
2025-05-07T20:33:19.2614508Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2614598Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2614669Z         self,
2025-05-07T20:33:19.2614741Z         T: int,
2025-05-07T20:33:19.2614815Z         D: int,
2025-05-07T20:33:19.2614906Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2614991Z         contiguous: bool,
2025-05-07T20:33:19.2615073Z         compiled: bool,
2025-05-07T20:33:19.2615147Z     ) -> None:
2025-05-07T20:33:19.2615237Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2615310Z     
2025-05-07T20:33:19.2615472Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2615610Z     
2025-05-07T20:33:19.2615736Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2615866Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2615968Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2616054Z         x0 = x[:, :D]
2025-05-07T20:33:19.2616137Z         x1 = x[:, D:]
2025-05-07T20:33:19.2616208Z     
2025-05-07T20:33:19.2616288Z         if contiguous:
2025-05-07T20:33:19.2616373Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2616460Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2616524Z     
2025-05-07T20:33:19.2616608Z         if scale_ub is not None:
2025-05-07T20:33:19.2616711Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2616838Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2616955Z             )
2025-05-07T20:33:19.2617029Z         else:
2025-05-07T20:33:19.2617116Z             scale_ub_tensor = None
2025-05-07T20:33:19.2617183Z     
2025-05-07T20:33:19.2617312Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2617396Z             op = silu_mul_quant
2025-05-07T20:33:19.2617478Z             if compiled:
2025-05-07T20:33:19.2617570Z                 op = torch.compile(op)
2025-05-07T20:33:19.2617669Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2617738Z     
2025-05-07T20:33:19.2617824Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2617829Z 
2025-05-07T20:33:19.2617920Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2618045Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2618138Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2618231Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2618596Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.2618686Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.2619176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2619268Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2619618Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2619836Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2620170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2620262Z     kernel = self.compile(
2025-05-07T20:33:19.2620638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2620806Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2620931Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2620936Z 
2025-05-07T20:33:19.2621177Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d9d91d50>
2025-05-07T20:33:19.2621943Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2622433Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d9a74c20>}
2025-05-07T20:33:19.2623165Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2623354Z context = <triton._C.libtriton.ir.context object at 0x7f38d9ad44f0>
2025-05-07T20:33:19.2623359Z 
2025-05-07T20:33:19.2623518Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2623815Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2623954Z                            module_map=module_map)
2025-05-07T20:33:19.2624108Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2624205Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2624279Z E       ^
2025-05-07T20:33:19.2624623Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2624632Z 
2025-05-07T20:33:19.2625039Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2625043Z 
2025-05-07T20:33:19.2625139Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2625394Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2625465Z     T=4096,
2025-05-07T20:33:19.2625539Z     D=5120,
2025-05-07T20:33:19.2625619Z     scale_ub=1200.0,
2025-05-07T20:33:19.2625707Z     contiguous=False,
2025-05-07T20:33:19.2625787Z     compiled=False,
2025-05-07T20:33:19.2625856Z )
2025-05-07T20:33:19.2626088Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2626289Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:19.2626294Z 
2025-05-07T20:33:19.2626369Z     @given(
2025-05-07T20:33:19.2626479Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2626573Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2626683Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2626794Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2626903Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2626973Z     )
2025-05-07T20:33:19.2627209Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2627301Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2627377Z         self,
2025-05-07T20:33:19.2627503Z         T: int,
2025-05-07T20:33:19.2627575Z         D: int,
2025-05-07T20:33:19.2627667Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2627754Z         contiguous: bool,
2025-05-07T20:33:19.2627833Z         compiled: bool,
2025-05-07T20:33:19.2627905Z     ) -> None:
2025-05-07T20:33:19.2627995Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2628064Z     
2025-05-07T20:33:19.2628224Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2628294Z     
2025-05-07T20:33:19.2628379Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2628497Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2628583Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2628659Z         x0 = x[:, :D]
2025-05-07T20:33:19.2628734Z         x1 = x[:, D:]
2025-05-07T20:33:19.2628800Z     
2025-05-07T20:33:19.2628877Z         if contiguous:
2025-05-07T20:33:19.2629010Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2629098Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2629165Z     
2025-05-07T20:33:19.2629250Z         if scale_ub is not None:
2025-05-07T20:33:19.2629349Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2629478Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2629548Z             )
2025-05-07T20:33:19.2629618Z         else:
2025-05-07T20:33:19.2629706Z             scale_ub_tensor = None
2025-05-07T20:33:19.2629776Z     
2025-05-07T20:33:19.2629899Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2629982Z             op = silu_mul_quant
2025-05-07T20:33:19.2630064Z             if compiled:
2025-05-07T20:33:19.2630156Z                 op = torch.compile(op)
2025-05-07T20:33:19.2630262Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2630330Z     
2025-05-07T20:33:19.2630415Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2630424Z 
2025-05-07T20:33:19.2630516Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2630718Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2630812Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2630906Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2631391Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2631483Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2631835Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2632050Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2632388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2632517Z     kernel = self.compile(
2025-05-07T20:33:19.2632899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2633072Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2633192Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2633197Z 
2025-05-07T20:33:19.2633393Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d8c173d0>
2025-05-07T20:33:19.2634151Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2634637Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d9a756c0>}
2025-05-07T20:33:19.2635375Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2635559Z context = <triton._C.libtriton.ir.context object at 0x7f38d8307eb0>
2025-05-07T20:33:19.2635564Z 
2025-05-07T20:33:19.2635726Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2635978Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2636083Z                            module_map=module_map)
2025-05-07T20:33:19.2636235Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2636327Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2636398Z E       ^
2025-05-07T20:33:19.2636743Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2636751Z 
2025-05-07T20:33:19.2637201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2637211Z 
2025-05-07T20:33:19.2637311Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2637525Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2637597Z     T=4096,
2025-05-07T20:33:19.2637667Z     D=5120,
2025-05-07T20:33:19.2637747Z     scale_ub=1200.0,
2025-05-07T20:33:19.2637831Z     contiguous=False,
2025-05-07T20:33:19.2637907Z     compiled=True,
2025-05-07T20:33:19.2637972Z )
2025-05-07T20:33:19.2638185Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2638353Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:19.2638357Z 
2025-05-07T20:33:19.2638429Z     @given(
2025-05-07T20:33:19.2638545Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2638637Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2638749Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2638938Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2639047Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2639117Z     )
2025-05-07T20:33:19.2639364Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2639450Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2639523Z         self,
2025-05-07T20:33:19.2639596Z         T: int,
2025-05-07T20:33:19.2639667Z         D: int,
2025-05-07T20:33:19.2639761Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2639842Z         contiguous: bool,
2025-05-07T20:33:19.2639921Z         compiled: bool,
2025-05-07T20:33:19.2639997Z     ) -> None:
2025-05-07T20:33:19.2640296Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2640519Z     
2025-05-07T20:33:19.2640698Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2640765Z     
2025-05-07T20:33:19.2640859Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2640980Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2641066Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2641142Z         x0 = x[:, :D]
2025-05-07T20:33:19.2641215Z         x1 = x[:, D:]
2025-05-07T20:33:19.2641284Z     
2025-05-07T20:33:19.2641363Z         if contiguous:
2025-05-07T20:33:19.2641449Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2641530Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2641601Z     
2025-05-07T20:33:19.2641685Z         if scale_ub is not None:
2025-05-07T20:33:19.2641789Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2641916Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2641991Z             )
2025-05-07T20:33:19.2642068Z         else:
2025-05-07T20:33:19.2642162Z             scale_ub_tensor = None
2025-05-07T20:33:19.2642231Z     
2025-05-07T20:33:19.2642358Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2642445Z             op = silu_mul_quant
2025-05-07T20:33:19.2642528Z             if compiled:
2025-05-07T20:33:19.2642631Z                 op = torch.compile(op)
2025-05-07T20:33:19.2642736Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2642807Z     
2025-05-07T20:33:19.2642897Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2642901Z 
2025-05-07T20:33:19.2642991Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2643117Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2643210Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2643303Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2643662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.2643750Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.2644335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2644436Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2644793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2645015Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2645347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2645438Z     kernel = self.compile(
2025-05-07T20:33:19.2645839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2646006Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2646128Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2646138Z 
2025-05-07T20:33:19.2646334Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d894e450>
2025-05-07T20:33:19.2647157Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2647701Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d9a76fc0>}
2025-05-07T20:33:19.2648431Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2648620Z context = <triton._C.libtriton.ir.context object at 0x7f38d88ca0b0>
2025-05-07T20:33:19.2648625Z 
2025-05-07T20:33:19.2648832Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2649091Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2649204Z                            module_map=module_map)
2025-05-07T20:33:19.2649360Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2649459Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2649532Z E       ^
2025-05-07T20:33:19.2649876Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2649881Z 
2025-05-07T20:33:19.2650314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2650319Z 
2025-05-07T20:33:19.2650417Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2650633Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2650717Z     T=2048,
2025-05-07T20:33:19.2650791Z     D=7168,
2025-05-07T20:33:19.2650876Z     scale_ub=1200.0,
2025-05-07T20:33:19.2650960Z     contiguous=False,
2025-05-07T20:33:19.2651047Z     compiled=False,
2025-05-07T20:33:19.2651125Z )
2025-05-07T20:33:19.2651341Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2651512Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:19.2651517Z 
2025-05-07T20:33:19.2651596Z     @given(
2025-05-07T20:33:19.2651710Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2651806Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2651923Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2652035Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2652148Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2652221Z     )
2025-05-07T20:33:19.2652460Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2652558Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2652632Z         self,
2025-05-07T20:33:19.2652752Z         T: int,
2025-05-07T20:33:19.2652827Z         D: int,
2025-05-07T20:33:19.2652925Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2653008Z         contiguous: bool,
2025-05-07T20:33:19.2653090Z         compiled: bool,
2025-05-07T20:33:19.2653167Z     ) -> None:
2025-05-07T20:33:19.2653257Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2653329Z     
2025-05-07T20:33:19.2653491Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2653565Z     
2025-05-07T20:33:19.2653652Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2653771Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2653860Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2653936Z         x0 = x[:, :D]
2025-05-07T20:33:19.2654010Z         x1 = x[:, D:]
2025-05-07T20:33:19.2654087Z     
2025-05-07T20:33:19.2654166Z         if contiguous:
2025-05-07T20:33:19.2654252Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2654340Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2654413Z     
2025-05-07T20:33:19.2654649Z         if scale_ub is not None:
2025-05-07T20:33:19.2654756Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2654884Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2654962Z             )
2025-05-07T20:33:19.2655034Z         else:
2025-05-07T20:33:19.2655124Z             scale_ub_tensor = None
2025-05-07T20:33:19.2655196Z     
2025-05-07T20:33:19.2655319Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2655403Z             op = silu_mul_quant
2025-05-07T20:33:19.2655491Z             if compiled:
2025-05-07T20:33:19.2655587Z                 op = torch.compile(op)
2025-05-07T20:33:19.2655689Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2655765Z     
2025-05-07T20:33:19.2655892Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2655897Z 
2025-05-07T20:33:19.2655990Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2656120Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2656220Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2656317Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2656801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2656893Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2657252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2657468Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2657809Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2657903Z     kernel = self.compile(
2025-05-07T20:33:19.2658299Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2658476Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2658604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2658609Z 
2025-05-07T20:33:19.2658807Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d9a11250>
2025-05-07T20:33:19.2659573Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2660061Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d9a77ec0>}
2025-05-07T20:33:19.2660846Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2661038Z context = <triton._C.libtriton.ir.context object at 0x7f38d84001f0>
2025-05-07T20:33:19.2661043Z 
2025-05-07T20:33:19.2661205Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2661461Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2661565Z                            module_map=module_map)
2025-05-07T20:33:19.2661731Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2661823Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2661895Z E       ^
2025-05-07T20:33:19.2662246Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2662250Z 
2025-05-07T20:33:19.2662662Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2662667Z 
2025-05-07T20:33:19.2662773Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2663068Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2663144Z     T=1,
2025-05-07T20:33:19.2663223Z     D=7168,
2025-05-07T20:33:19.2663300Z     scale_ub=None,
2025-05-07T20:33:19.2663380Z     contiguous=True,
2025-05-07T20:33:19.2663465Z     compiled=False,
2025-05-07T20:33:19.2663535Z )
2025-05-07T20:33:19.2663746Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2663910Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:19.2663915Z 
2025-05-07T20:33:19.2663991Z     @given(
2025-05-07T20:33:19.2664107Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2664203Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2664356Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2664474Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2664585Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2664662Z     )
2025-05-07T20:33:19.2664901Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2664987Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2665062Z         self,
2025-05-07T20:33:19.2665134Z         T: int,
2025-05-07T20:33:19.2665204Z         D: int,
2025-05-07T20:33:19.2665303Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2665386Z         contiguous: bool,
2025-05-07T20:33:19.2665464Z         compiled: bool,
2025-05-07T20:33:19.2665539Z     ) -> None:
2025-05-07T20:33:19.2665627Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2665695Z     
2025-05-07T20:33:19.2665865Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2665939Z     
2025-05-07T20:33:19.2666038Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2666174Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2666284Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2666372Z         x0 = x[:, :D]
2025-05-07T20:33:19.2666454Z         x1 = x[:, D:]
2025-05-07T20:33:19.2666525Z     
2025-05-07T20:33:19.2666610Z         if contiguous:
2025-05-07T20:33:19.2666697Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2666783Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2666856Z     
2025-05-07T20:33:19.2666941Z         if scale_ub is not None:
2025-05-07T20:33:19.2667041Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2667171Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2667243Z             )
2025-05-07T20:33:19.2667316Z         else:
2025-05-07T20:33:19.2667461Z             scale_ub_tensor = None
2025-05-07T20:33:19.2667531Z     
2025-05-07T20:33:19.2667654Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2667743Z             op = silu_mul_quant
2025-05-07T20:33:19.2667823Z             if compiled:
2025-05-07T20:33:19.2667964Z                 op = torch.compile(op)
2025-05-07T20:33:19.2668070Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2668136Z     
2025-05-07T20:33:19.2668224Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2668229Z 
2025-05-07T20:33:19.2668320Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2668441Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2668536Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2668630Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2669120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2669212Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2669564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2669784Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2670164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2670289Z     kernel = self.compile(
2025-05-07T20:33:19.2670692Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2670858Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2670982Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2670987Z 
2025-05-07T20:33:19.2671184Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d8d0eed0>
2025-05-07T20:33:19.2671941Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2672501Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d88d4cc0>}
2025-05-07T20:33:19.2673235Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2673425Z context = <triton._C.libtriton.ir.context object at 0x7f38d8191870>
2025-05-07T20:33:19.2673430Z 
2025-05-07T20:33:19.2673585Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2673837Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2673944Z                            module_map=module_map)
2025-05-07T20:33:19.2674100Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2674197Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2674272Z E       ^
2025-05-07T20:33:19.2674622Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2674629Z 
2025-05-07T20:33:19.2675043Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2675048Z 
2025-05-07T20:33:19.2675145Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2675362Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2675436Z     T=16384,
2025-05-07T20:33:19.2675509Z     D=7168,
2025-05-07T20:33:19.2675586Z     scale_ub=1200.0,
2025-05-07T20:33:19.2675667Z     contiguous=False,
2025-05-07T20:33:19.2675747Z     compiled=True,
2025-05-07T20:33:19.2675816Z )
2025-05-07T20:33:19.2676052Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2676253Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:19.2676257Z 
2025-05-07T20:33:19.2676381Z     @given(
2025-05-07T20:33:19.2676498Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2676600Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2676710Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2676820Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2676928Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2677000Z     )
2025-05-07T20:33:19.2677236Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2677327Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2677397Z         self,
2025-05-07T20:33:19.2677471Z         T: int,
2025-05-07T20:33:19.2677547Z         D: int,
2025-05-07T20:33:19.2677638Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2677728Z         contiguous: bool,
2025-05-07T20:33:19.2677813Z         compiled: bool,
2025-05-07T20:33:19.2677885Z     ) -> None:
2025-05-07T20:33:19.2677979Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2678047Z     
2025-05-07T20:33:19.2678289Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2678368Z     
2025-05-07T20:33:19.2678454Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2678572Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2678657Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2678732Z         x0 = x[:, :D]
2025-05-07T20:33:19.2678805Z         x1 = x[:, D:]
2025-05-07T20:33:19.2678874Z     
2025-05-07T20:33:19.2678953Z         if contiguous:
2025-05-07T20:33:19.2679038Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2679127Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2679195Z     
2025-05-07T20:33:19.2679281Z         if scale_ub is not None:
2025-05-07T20:33:19.2679379Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2679551Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2679627Z             )
2025-05-07T20:33:19.2679703Z         else:
2025-05-07T20:33:19.2679792Z             scale_ub_tensor = None
2025-05-07T20:33:19.2679870Z     
2025-05-07T20:33:19.2679993Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2680077Z             op = silu_mul_quant
2025-05-07T20:33:19.2680161Z             if compiled:
2025-05-07T20:33:19.2680254Z                 op = torch.compile(op)
2025-05-07T20:33:19.2680352Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2680423Z     
2025-05-07T20:33:19.2680507Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2680512Z 
2025-05-07T20:33:19.2680606Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2680728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2680823Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2680920Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2681281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.2681371Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.2681864Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2681954Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2682307Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2682522Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2682855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2682945Z     kernel = self.compile(
2025-05-07T20:33:19.2683338Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2683506Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2683675Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2683683Z 
2025-05-07T20:33:19.2683881Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d814efd0>
2025-05-07T20:33:19.2684642Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2685127Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d88d60c0>}
2025-05-07T20:33:19.2685858Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2686041Z context = <triton._C.libtriton.ir.context object at 0x7f38d8144970>
2025-05-07T20:33:19.2686048Z 
2025-05-07T20:33:19.2686245Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2686538Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2686639Z                            module_map=module_map)
2025-05-07T20:33:19.2686800Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2686892Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2686964Z E       ^
2025-05-07T20:33:19.2687315Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2687319Z 
2025-05-07T20:33:19.2687748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2687792Z 
2025-05-07T20:33:19.2687890Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2688111Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2688185Z     T=1,
2025-05-07T20:33:19.2688264Z     D=7168,
2025-05-07T20:33:19.2688341Z     scale_ub=None,
2025-05-07T20:33:19.2688420Z     contiguous=False,
2025-05-07T20:33:19.2688504Z     compiled=False,
2025-05-07T20:33:19.2688570Z )
2025-05-07T20:33:19.2688782Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2688947Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:19.2688952Z 
2025-05-07T20:33:19.2689026Z     @given(
2025-05-07T20:33:19.2689139Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2689237Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2689346Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2689458Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2689568Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2689636Z     )
2025-05-07T20:33:19.2689877Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2689968Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2690040Z         self,
2025-05-07T20:33:19.2690116Z         T: int,
2025-05-07T20:33:19.2690188Z         D: int,
2025-05-07T20:33:19.2690279Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2690365Z         contiguous: bool,
2025-05-07T20:33:19.2690444Z         compiled: bool,
2025-05-07T20:33:19.2690516Z     ) -> None:
2025-05-07T20:33:19.2690607Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2690676Z     
2025-05-07T20:33:19.2690840Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2690909Z     
2025-05-07T20:33:19.2690994Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2691114Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2691198Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2691271Z         x0 = x[:, :D]
2025-05-07T20:33:19.2691348Z         x1 = x[:, D:]
2025-05-07T20:33:19.2691459Z     
2025-05-07T20:33:19.2691539Z         if contiguous:
2025-05-07T20:33:19.2691628Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2691712Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2691781Z     
2025-05-07T20:33:19.2691868Z         if scale_ub is not None:
2025-05-07T20:33:19.2691967Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2692099Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2692170Z             )
2025-05-07T20:33:19.2692240Z         else:
2025-05-07T20:33:19.2692332Z             scale_ub_tensor = None
2025-05-07T20:33:19.2692399Z     
2025-05-07T20:33:19.2692525Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2692611Z             op = silu_mul_quant
2025-05-07T20:33:19.2692692Z             if compiled:
2025-05-07T20:33:19.2692787Z                 op = torch.compile(op)
2025-05-07T20:33:19.2692888Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2692956Z     
2025-05-07T20:33:19.2693044Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2693091Z 
2025-05-07T20:33:19.2693223Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2693348Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2693448Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2693543Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2694035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2694134Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2694489Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2694707Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2695086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2695179Z     kernel = self.compile(
2025-05-07T20:33:19.2695568Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2695735Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2695856Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2695860Z 
2025-05-07T20:33:19.2696057Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d8d0df50>
2025-05-07T20:33:19.2696816Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2697311Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d88d6c00>}
2025-05-07T20:33:19.2698043Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2698227Z context = <triton._C.libtriton.ir.context object at 0x7f38d83b2970>
2025-05-07T20:33:19.2698234Z 
2025-05-07T20:33:19.2698390Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2698643Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2698746Z                            module_map=module_map)
2025-05-07T20:33:19.2698903Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2698995Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2699079Z E       ^
2025-05-07T20:33:19.2699426Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2699471Z 
2025-05-07T20:33:19.2699911Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2699918Z 
2025-05-07T20:33:19.2700015Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2700231Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2700307Z     T=2048,
2025-05-07T20:33:19.2700379Z     D=7168,
2025-05-07T20:33:19.2700454Z     scale_ub=None,
2025-05-07T20:33:19.2700543Z     contiguous=False,
2025-05-07T20:33:19.2700622Z     compiled=True,
2025-05-07T20:33:19.2700693Z )
2025-05-07T20:33:19.2700907Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2701073Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:19.2701083Z 
2025-05-07T20:33:19.2701159Z     @given(
2025-05-07T20:33:19.2704583Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2704699Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2704979Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2705092Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2705200Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2705273Z     )
2025-05-07T20:33:19.2705511Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2705604Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2705679Z         self,
2025-05-07T20:33:19.2705754Z         T: int,
2025-05-07T20:33:19.2705828Z         D: int,
2025-05-07T20:33:19.2705921Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2706004Z         contiguous: bool,
2025-05-07T20:33:19.2706087Z         compiled: bool,
2025-05-07T20:33:19.2706162Z     ) -> None:
2025-05-07T20:33:19.2706293Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2706365Z     
2025-05-07T20:33:19.2706529Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2706603Z     
2025-05-07T20:33:19.2706694Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2706817Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2706899Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2706978Z         x0 = x[:, :D]
2025-05-07T20:33:19.2707052Z         x1 = x[:, D:]
2025-05-07T20:33:19.2707126Z     
2025-05-07T20:33:19.2707205Z         if contiguous:
2025-05-07T20:33:19.2707289Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2707376Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2707504Z     
2025-05-07T20:33:19.2707590Z         if scale_ub is not None:
2025-05-07T20:33:19.2707695Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2707824Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2707900Z             )
2025-05-07T20:33:19.2707975Z         else:
2025-05-07T20:33:19.2708063Z             scale_ub_tensor = None
2025-05-07T20:33:19.2708128Z     
2025-05-07T20:33:19.2708257Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2708346Z             op = silu_mul_quant
2025-05-07T20:33:19.2708430Z             if compiled:
2025-05-07T20:33:19.2708525Z                 op = torch.compile(op)
2025-05-07T20:33:19.2708625Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2708693Z     
2025-05-07T20:33:19.2708780Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2708785Z 
2025-05-07T20:33:19.2708877Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2709003Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2709100Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2709194Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2709567Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.2709661Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.2710202Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2710302Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2710655Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2710875Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2711209Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2711298Z     kernel = self.compile(
2025-05-07T20:33:19.2711698Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2711867Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2711994Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2711999Z 
2025-05-07T20:33:19.2712197Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d91fb7d0>
2025-05-07T20:33:19.2713041Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2713532Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d93802c0>}
2025-05-07T20:33:19.2714260Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2714447Z context = <triton._C.libtriton.ir.context object at 0x7f38d935b0b0>
2025-05-07T20:33:19.2714491Z 
2025-05-07T20:33:19.2714648Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2714911Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2715016Z                            module_map=module_map)
2025-05-07T20:33:19.2715170Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2715269Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2715344Z E       ^
2025-05-07T20:33:19.2715691Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2715696Z 
2025-05-07T20:33:19.2716144Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2716149Z 
2025-05-07T20:33:19.2716266Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2716492Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2716567Z     T=4096,
2025-05-07T20:33:19.2716641Z     D=7168,
2025-05-07T20:33:19.2716725Z     scale_ub=None,
2025-05-07T20:33:19.2716810Z     contiguous=False,
2025-05-07T20:33:19.2716893Z     compiled=True,
2025-05-07T20:33:19.2716965Z )
2025-05-07T20:33:19.2717174Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2717342Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:19.2717350Z 
2025-05-07T20:33:19.2717421Z     @given(
2025-05-07T20:33:19.2717533Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2717628Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2717734Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2717844Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2717955Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2718028Z     )
2025-05-07T20:33:19.2718264Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2718353Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2718469Z         self,
2025-05-07T20:33:19.2718549Z         T: int,
2025-05-07T20:33:19.2718621Z         D: int,
2025-05-07T20:33:19.2718713Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2718798Z         contiguous: bool,
2025-05-07T20:33:19.2718879Z         compiled: bool,
2025-05-07T20:33:19.2718950Z     ) -> None:
2025-05-07T20:33:19.2719040Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2719108Z     
2025-05-07T20:33:19.2719268Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2719338Z     
2025-05-07T20:33:19.2719427Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2719547Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2719639Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2719714Z         x0 = x[:, :D]
2025-05-07T20:33:19.2719794Z         x1 = x[:, D:]
2025-05-07T20:33:19.2719868Z     
2025-05-07T20:33:19.2719948Z         if contiguous:
2025-05-07T20:33:19.2720042Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2720130Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2720275Z     
2025-05-07T20:33:19.2720365Z         if scale_ub is not None:
2025-05-07T20:33:19.2720464Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2720594Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2720671Z             )
2025-05-07T20:33:19.2720743Z         else:
2025-05-07T20:33:19.2720830Z             scale_ub_tensor = None
2025-05-07T20:33:19.2720902Z     
2025-05-07T20:33:19.2721025Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2721109Z             op = silu_mul_quant
2025-05-07T20:33:19.2721194Z             if compiled:
2025-05-07T20:33:19.2721289Z                 op = torch.compile(op)
2025-05-07T20:33:19.2721397Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2721504Z     
2025-05-07T20:33:19.2721589Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2721594Z 
2025-05-07T20:33:19.2721692Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2721818Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2721916Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2722011Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2722374Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.2722464Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.2722951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2723043Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2723397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2723616Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2723952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2724048Z     kernel = self.compile(
2025-05-07T20:33:19.2724424Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2724594Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2724714Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2724718Z 
2025-05-07T20:33:19.2724914Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d894fa50>
2025-05-07T20:33:19.2725678Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2726210Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d9380d60>}
2025-05-07T20:33:19.2726949Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2727131Z context = <triton._C.libtriton.ir.context object at 0x7f38d93c4470>
2025-05-07T20:33:19.2727136Z 
2025-05-07T20:33:19.2727291Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2727556Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2727657Z                            module_map=module_map)
2025-05-07T20:33:19.2727817Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2727913Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2727984Z E       ^
2025-05-07T20:33:19.2728335Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2728382Z 
2025-05-07T20:33:19.2728828Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2728833Z 
2025-05-07T20:33:19.2728933Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2729145Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2729219Z     T=16384,
2025-05-07T20:33:19.2729295Z     D=5120,
2025-05-07T20:33:19.2729374Z     scale_ub=1200.0,
2025-05-07T20:33:19.2729458Z     contiguous=False,
2025-05-07T20:33:19.2729536Z     compiled=False,
2025-05-07T20:33:19.2729604Z )
2025-05-07T20:33:19.2729815Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2729990Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:19.2730035Z 
2025-05-07T20:33:19.2730108Z     @given(
2025-05-07T20:33:19.2730225Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2730325Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2730434Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2730546Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2730652Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2730722Z     )
2025-05-07T20:33:19.2730959Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2731046Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2731118Z         self,
2025-05-07T20:33:19.2731193Z         T: int,
2025-05-07T20:33:19.2731263Z         D: int,
2025-05-07T20:33:19.2731354Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2731441Z         contiguous: bool,
2025-05-07T20:33:19.2731524Z         compiled: bool,
2025-05-07T20:33:19.2731597Z     ) -> None:
2025-05-07T20:33:19.2731683Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2731751Z     
2025-05-07T20:33:19.2731918Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2731994Z     
2025-05-07T20:33:19.2732082Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2732203Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2732288Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2732360Z         x0 = x[:, :D]
2025-05-07T20:33:19.2732439Z         x1 = x[:, D:]
2025-05-07T20:33:19.2732506Z     
2025-05-07T20:33:19.2732585Z         if contiguous:
2025-05-07T20:33:19.2732676Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2732760Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2732827Z     
2025-05-07T20:33:19.2732911Z         if scale_ub is not None:
2025-05-07T20:33:19.2733011Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2733143Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2733216Z             )
2025-05-07T20:33:19.2733290Z         else:
2025-05-07T20:33:19.2733428Z             scale_ub_tensor = None
2025-05-07T20:33:19.2733500Z     
2025-05-07T20:33:19.2733631Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2733720Z             op = silu_mul_quant
2025-05-07T20:33:19.2733801Z             if compiled:
2025-05-07T20:33:19.2733894Z                 op = torch.compile(op)
2025-05-07T20:33:19.2733996Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2734060Z     
2025-05-07T20:33:19.2734150Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2734155Z 
2025-05-07T20:33:19.2734250Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2734373Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2734470Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2734565Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2735052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2735149Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2735565Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2735820Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2736153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2736241Z     kernel = self.compile(
2025-05-07T20:33:19.2736623Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2736790Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2736909Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2736958Z 
2025-05-07T20:33:19.2737154Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d9a13550>
2025-05-07T20:33:19.2737921Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2738414Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d9381c60>}
2025-05-07T20:33:19.2739145Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2739333Z context = <triton._C.libtriton.ir.context object at 0x7f38d83373b0>
2025-05-07T20:33:19.2739337Z 
2025-05-07T20:33:19.2739494Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2739750Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2739858Z                            module_map=module_map)
2025-05-07T20:33:19.2740017Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2740350Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2740470Z E       ^
2025-05-07T20:33:19.2740856Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2740861Z 
2025-05-07T20:33:19.2741278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2741284Z 
2025-05-07T20:33:19.2741382Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2741598Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2741675Z     T=16384,
2025-05-07T20:33:19.2741751Z     D=5120,
2025-05-07T20:33:19.2741830Z     scale_ub=1200.0,
2025-05-07T20:33:19.2741909Z     contiguous=True,
2025-05-07T20:33:19.2741990Z     compiled=True,
2025-05-07T20:33:19.2742157Z )
2025-05-07T20:33:19.2742375Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2742549Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:19.2742554Z 
2025-05-07T20:33:19.2742630Z     @given(
2025-05-07T20:33:19.2742742Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2742837Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2742950Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2743061Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2743172Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2743242Z     )
2025-05-07T20:33:19.2743479Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2743576Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2743646Z         self,
2025-05-07T20:33:19.2743719Z         T: int,
2025-05-07T20:33:19.2743805Z         D: int,
2025-05-07T20:33:19.2743900Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2744107Z         contiguous: bool,
2025-05-07T20:33:19.2744197Z         compiled: bool,
2025-05-07T20:33:19.2744278Z     ) -> None:
2025-05-07T20:33:19.2744370Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2744443Z     
2025-05-07T20:33:19.2744606Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2744680Z     
2025-05-07T20:33:19.2744768Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2744887Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2744975Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2745051Z         x0 = x[:, :D]
2025-05-07T20:33:19.2745128Z         x1 = x[:, D:]
2025-05-07T20:33:19.2745197Z     
2025-05-07T20:33:19.2745275Z         if contiguous:
2025-05-07T20:33:19.2745429Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2745517Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2745586Z     
2025-05-07T20:33:19.2745677Z         if scale_ub is not None:
2025-05-07T20:33:19.2745787Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2745920Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2745999Z             )
2025-05-07T20:33:19.2746074Z         else:
2025-05-07T20:33:19.2746166Z             scale_ub_tensor = None
2025-05-07T20:33:19.2746245Z     
2025-05-07T20:33:19.2746373Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2746461Z             op = silu_mul_quant
2025-05-07T20:33:19.2746547Z             if compiled:
2025-05-07T20:33:19.2746642Z                 op = torch.compile(op)
2025-05-07T20:33:19.2746745Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2746821Z     
2025-05-07T20:33:19.2746909Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2746917Z 
2025-05-07T20:33:19.2747009Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2747135Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2747234Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2747339Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2747792Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.2747883Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.2748371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2748464Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2748817Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2749034Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2749369Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2749460Z     kernel = self.compile(
2025-05-07T20:33:19.2749906Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2750077Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2750205Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2750210Z 
2025-05-07T20:33:19.2750406Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d81367d0>
2025-05-07T20:33:19.2751168Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2751655Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d9383380>}
2025-05-07T20:33:19.2752431Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2752654Z context = <triton._C.libtriton.ir.context object at 0x7f38d83fe5f0>
2025-05-07T20:33:19.2752659Z 
2025-05-07T20:33:19.2752816Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2753073Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2753176Z                            module_map=module_map)
2025-05-07T20:33:19.2753329Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2753425Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2753498Z E       ^
2025-05-07T20:33:19.2753847Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2753894Z 
2025-05-07T20:33:19.2754309Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2754318Z 
2025-05-07T20:33:19.2754415Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2754633Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2754707Z     T=16384,
2025-05-07T20:33:19.2754780Z     D=5120,
2025-05-07T20:33:19.2754860Z     scale_ub=None,
2025-05-07T20:33:19.2754943Z     contiguous=False,
2025-05-07T20:33:19.2755021Z     compiled=True,
2025-05-07T20:33:19.2755090Z )
2025-05-07T20:33:19.2755302Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2755475Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:19.2755480Z 
2025-05-07T20:33:19.2755556Z     @given(
2025-05-07T20:33:19.2755669Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2755767Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2755895Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2756026Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2756157Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2756228Z     )
2025-05-07T20:33:19.2756466Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2756556Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2756628Z         self,
2025-05-07T20:33:19.2756702Z         T: int,
2025-05-07T20:33:19.2756775Z         D: int,
2025-05-07T20:33:19.2756866Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2756953Z         contiguous: bool,
2025-05-07T20:33:19.2757033Z         compiled: bool,
2025-05-07T20:33:19.2757108Z     ) -> None:
2025-05-07T20:33:19.2757200Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2757272Z     
2025-05-07T20:33:19.2757434Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2757510Z     
2025-05-07T20:33:19.2757643Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2757770Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2757855Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2757930Z         x0 = x[:, :D]
2025-05-07T20:33:19.2758005Z         x1 = x[:, D:]
2025-05-07T20:33:19.2758074Z     
2025-05-07T20:33:19.2758152Z         if contiguous:
2025-05-07T20:33:19.2758239Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2758322Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2758390Z     
2025-05-07T20:33:19.2758477Z         if scale_ub is not None:
2025-05-07T20:33:19.2758578Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2758706Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2758782Z             )
2025-05-07T20:33:19.2758855Z         else:
2025-05-07T20:33:19.2758947Z             scale_ub_tensor = None
2025-05-07T20:33:19.2759017Z     
2025-05-07T20:33:19.2759147Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2759240Z             op = silu_mul_quant
2025-05-07T20:33:19.2759399Z             if compiled:
2025-05-07T20:33:19.2759493Z                 op = torch.compile(op)
2025-05-07T20:33:19.2759594Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2759664Z     
2025-05-07T20:33:19.2759749Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2759753Z 
2025-05-07T20:33:19.2759848Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2759970Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2760066Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2760161Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2760530Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.2760663Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.2761148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2761243Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2761600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2761817Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2762155Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2762245Z     kernel = self.compile(
2025-05-07T20:33:19.2762626Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2762798Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2762919Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2762927Z 
2025-05-07T20:33:19.2763125Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d9a11ed0>
2025-05-07T20:33:19.2763893Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2764389Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d83485e0>}
2025-05-07T20:33:19.2765126Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2765310Z context = <triton._C.libtriton.ir.context object at 0x7f38d8431030>
2025-05-07T20:33:19.2765317Z 
2025-05-07T20:33:19.2765484Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2765809Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2765921Z                            module_map=module_map)
2025-05-07T20:33:19.2766084Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2766180Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2766256Z E       ^
2025-05-07T20:33:19.2766607Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2766611Z 
2025-05-07T20:33:19.2767024Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2767029Z 
2025-05-07T20:33:19.2767130Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2767346Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2767420Z     T=2048,
2025-05-07T20:33:19.2767494Z     D=5120,
2025-05-07T20:33:19.2767574Z     scale_ub=None,
2025-05-07T20:33:19.2767658Z     contiguous=False,
2025-05-07T20:33:19.2767742Z     compiled=True,
2025-05-07T20:33:19.2767857Z )
2025-05-07T20:33:19.2768107Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2768287Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:19.2768292Z 
2025-05-07T20:33:19.2768365Z     @given(
2025-05-07T20:33:19.2768490Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2768586Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2768715Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2768879Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2769029Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2769127Z     )
2025-05-07T20:33:19.2769451Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2769601Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2769679Z         self,
2025-05-07T20:33:19.2769758Z         T: int,
2025-05-07T20:33:19.2769834Z         D: int,
2025-05-07T20:33:19.2769937Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2770023Z         contiguous: bool,
2025-05-07T20:33:19.2770103Z         compiled: bool,
2025-05-07T20:33:19.2770185Z     ) -> None:
2025-05-07T20:33:19.2770276Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2770347Z     
2025-05-07T20:33:19.2770516Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2770585Z     
2025-05-07T20:33:19.2770673Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2770798Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2770881Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2770963Z         x0 = x[:, :D]
2025-05-07T20:33:19.2771041Z         x1 = x[:, D:]
2025-05-07T20:33:19.2771113Z     
2025-05-07T20:33:19.2771199Z         if contiguous:
2025-05-07T20:33:19.2771289Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2771377Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2771453Z     
2025-05-07T20:33:19.2771544Z         if scale_ub is not None:
2025-05-07T20:33:19.2771645Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2771781Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2771858Z             )
2025-05-07T20:33:19.2771933Z         else:
2025-05-07T20:33:19.2772029Z             scale_ub_tensor = None
2025-05-07T20:33:19.2772099Z     
2025-05-07T20:33:19.2772225Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2772316Z             op = silu_mul_quant
2025-05-07T20:33:19.2772399Z             if compiled:
2025-05-07T20:33:19.2772498Z                 op = torch.compile(op)
2025-05-07T20:33:19.2772600Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2772671Z     
2025-05-07T20:33:19.2772761Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2772766Z 
2025-05-07T20:33:19.2772859Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2773030Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2773138Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2773234Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2773597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.2773687Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.2774169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2774265Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2774615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2774832Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2775174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2775269Z     kernel = self.compile(
2025-05-07T20:33:19.2775753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2775924Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2776047Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2776052Z 
2025-05-07T20:33:19.2776254Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d894e750>
2025-05-07T20:33:19.2777013Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2777542Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d8349440>}
2025-05-07T20:33:19.2778276Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2778462Z context = <triton._C.libtriton.ir.context object at 0x7f38d84f1930>
2025-05-07T20:33:19.2778467Z 
2025-05-07T20:33:19.2778629Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2778885Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2778993Z                            module_map=module_map)
2025-05-07T20:33:19.2779149Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2779241Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2779316Z E       ^
2025-05-07T20:33:19.2779662Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2779670Z 
2025-05-07T20:33:19.2780111Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2780122Z 
2025-05-07T20:33:19.2780222Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2780440Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2780519Z     T=2048,
2025-05-07T20:33:19.2780590Z     D=5120,
2025-05-07T20:33:19.2780668Z     scale_ub=1200.0,
2025-05-07T20:33:19.2780756Z     contiguous=False,
2025-05-07T20:33:19.2780838Z     compiled=True,
2025-05-07T20:33:19.2780909Z )
2025-05-07T20:33:19.2781125Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2781294Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:19.2781301Z 
2025-05-07T20:33:19.2781376Z     @given(
2025-05-07T20:33:19.2781489Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2781710Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2781828Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2781941Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2782051Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2782126Z     )
2025-05-07T20:33:19.2782363Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2782451Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2782524Z         self,
2025-05-07T20:33:19.2782597Z         T: int,
2025-05-07T20:33:19.2782673Z         D: int,
2025-05-07T20:33:19.2782765Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2782849Z         contiguous: bool,
2025-05-07T20:33:19.2782935Z         compiled: bool,
2025-05-07T20:33:19.2783010Z     ) -> None:
2025-05-07T20:33:19.2783103Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2783177Z     
2025-05-07T20:33:19.2783343Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2783414Z     
2025-05-07T20:33:19.2783586Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2783705Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2783791Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2783873Z         x0 = x[:, :D]
2025-05-07T20:33:19.2783949Z         x1 = x[:, D:]
2025-05-07T20:33:19.2784018Z     
2025-05-07T20:33:19.2784098Z         if contiguous:
2025-05-07T20:33:19.2784183Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2784274Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2784344Z     
2025-05-07T20:33:19.2784429Z         if scale_ub is not None:
2025-05-07T20:33:19.2784534Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2784665Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2784781Z             )
2025-05-07T20:33:19.2784859Z         else:
2025-05-07T20:33:19.2784949Z             scale_ub_tensor = None
2025-05-07T20:33:19.2785018Z     
2025-05-07T20:33:19.2785156Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2785248Z             op = silu_mul_quant
2025-05-07T20:33:19.2785328Z             if compiled:
2025-05-07T20:33:19.2785428Z                 op = torch.compile(op)
2025-05-07T20:33:19.2785528Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2785601Z     
2025-05-07T20:33:19.2785688Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2785692Z 
2025-05-07T20:33:19.2785784Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2785912Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2786008Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2786104Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2786471Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.2786565Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.2787058Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2787154Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2787578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2787800Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2788134Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2788224Z     kernel = self.compile(
2025-05-07T20:33:19.2788605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2788778Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2788907Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2788913Z 
2025-05-07T20:33:19.2789154Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d81363d0>
2025-05-07T20:33:19.2789918Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2790409Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d834a660>}
2025-05-07T20:33:19.2791139Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2791328Z context = <triton._C.libtriton.ir.context object at 0x7f38d84a0e70>
2025-05-07T20:33:19.2791335Z 
2025-05-07T20:33:19.2791494Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2791797Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2791938Z                            module_map=module_map)
2025-05-07T20:33:19.2792096Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2792194Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2792271Z E       ^
2025-05-07T20:33:19.2792616Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2792620Z 
2025-05-07T20:33:19.2793031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2793036Z 
2025-05-07T20:33:19.2793134Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2793416Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2793491Z     T=4096,
2025-05-07T20:33:19.2793564Z     D=5120,
2025-05-07T20:33:19.2793652Z     scale_ub=1200.0,
2025-05-07T20:33:19.2793735Z     contiguous=True,
2025-05-07T20:33:19.2793817Z     compiled=True,
2025-05-07T20:33:19.2793890Z )
2025-05-07T20:33:19.2794105Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2794270Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:19.2794280Z 
2025-05-07T20:33:19.2794354Z     @given(
2025-05-07T20:33:19.2794470Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2794572Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2794682Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2794794Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2794905Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2794984Z     )
2025-05-07T20:33:19.2795221Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2795319Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2795395Z         self,
2025-05-07T20:33:19.2795474Z         T: int,
2025-05-07T20:33:19.2795553Z         D: int,
2025-05-07T20:33:19.2795648Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2795739Z         contiguous: bool,
2025-05-07T20:33:19.2795820Z         compiled: bool,
2025-05-07T20:33:19.2795894Z     ) -> None:
2025-05-07T20:33:19.2795987Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2796057Z     
2025-05-07T20:33:19.2796220Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2796293Z     
2025-05-07T20:33:19.2796381Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2796504Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2796596Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2796674Z         x0 = x[:, :D]
2025-05-07T20:33:19.2796752Z         x1 = x[:, D:]
2025-05-07T20:33:19.2796827Z     
2025-05-07T20:33:19.2796906Z         if contiguous:
2025-05-07T20:33:19.2797040Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2797133Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2797204Z     
2025-05-07T20:33:19.2797299Z         if scale_ub is not None:
2025-05-07T20:33:19.2797399Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2797531Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2797610Z             )
2025-05-07T20:33:19.2797682Z         else:
2025-05-07T20:33:19.2797771Z             scale_ub_tensor = None
2025-05-07T20:33:19.2797843Z     
2025-05-07T20:33:19.2797972Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2798059Z             op = silu_mul_quant
2025-05-07T20:33:19.2798148Z             if compiled:
2025-05-07T20:33:19.2798243Z                 op = torch.compile(op)
2025-05-07T20:33:19.2798344Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2798418Z     
2025-05-07T20:33:19.2798506Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2798511Z 
2025-05-07T20:33:19.2798608Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2798817Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2798915Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2799012Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2799380Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.2799471Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.2799960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2800052Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2800408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2800668Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2801005Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2801106Z     kernel = self.compile(
2025-05-07T20:33:19.2801485Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2801653Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2801786Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2801791Z 
2025-05-07T20:33:19.2801987Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d86c93d0>
2025-05-07T20:33:19.2802750Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2803244Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d834b9c0>}
2025-05-07T20:33:19.2803980Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2804162Z context = <triton._C.libtriton.ir.context object at 0x7f359bff7df0>
2025-05-07T20:33:19.2804167Z 
2025-05-07T20:33:19.2804324Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2804582Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2804684Z                            module_map=module_map)
2025-05-07T20:33:19.2804843Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2804942Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2805016Z E       ^
2025-05-07T20:33:19.2805413Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2805420Z 
2025-05-07T20:33:19.2805832Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2805837Z 
2025-05-07T20:33:19.2805938Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2806153Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2806229Z     T=128,
2025-05-07T20:33:19.2806309Z     D=5120,
2025-05-07T20:33:19.2806390Z     scale_ub=1200.0,
2025-05-07T20:33:19.2806474Z     contiguous=False,
2025-05-07T20:33:19.2806559Z     compiled=True,
2025-05-07T20:33:19.2806630Z )
2025-05-07T20:33:19.2806844Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2807017Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:19.2807021Z 
2025-05-07T20:33:19.2807095Z     @given(
2025-05-07T20:33:19.2807212Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2807391Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2807501Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2807616Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2807724Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2807797Z     )
2025-05-07T20:33:19.2808042Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2808131Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2808203Z         self,
2025-05-07T20:33:19.2808279Z         T: int,
2025-05-07T20:33:19.2808353Z         D: int,
2025-05-07T20:33:19.2808449Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2808535Z         contiguous: bool,
2025-05-07T20:33:19.2808656Z         compiled: bool,
2025-05-07T20:33:19.2808733Z     ) -> None:
2025-05-07T20:33:19.2808824Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2808891Z     
2025-05-07T20:33:19.2809056Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2809133Z     
2025-05-07T20:33:19.2809225Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2809348Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2809433Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2809509Z         x0 = x[:, :D]
2025-05-07T20:33:19.2809587Z         x1 = x[:, D:]
2025-05-07T20:33:19.2809657Z     
2025-05-07T20:33:19.2809734Z         if contiguous:
2025-05-07T20:33:19.2809822Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2809907Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2809976Z     
2025-05-07T20:33:19.2810061Z         if scale_ub is not None:
2025-05-07T20:33:19.2810163Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2810306Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2810379Z             )
2025-05-07T20:33:19.2810453Z         else:
2025-05-07T20:33:19.2810546Z             scale_ub_tensor = None
2025-05-07T20:33:19.2810617Z     
2025-05-07T20:33:19.2810746Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2810831Z             op = silu_mul_quant
2025-05-07T20:33:19.2810912Z             if compiled:
2025-05-07T20:33:19.2811006Z                 op = torch.compile(op)
2025-05-07T20:33:19.2811108Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2811177Z     
2025-05-07T20:33:19.2811261Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2811269Z 
2025-05-07T20:33:19.2811363Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2811485Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2811582Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2811676Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2812042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.2812134Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.2812665Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2812762Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2813115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2813330Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2813670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2813761Z     kernel = self.compile(
2025-05-07T20:33:19.2814157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2814334Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2814457Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2814461Z 
2025-05-07T20:33:19.2814739Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d8d0ebd0>
2025-05-07T20:33:19.2815503Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2816026Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f359bf04fe0>}
2025-05-07T20:33:19.2816776Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2817002Z context = <triton._C.libtriton.ir.context object at 0x7f359bfe3ab0>
2025-05-07T20:33:19.2817007Z 
2025-05-07T20:33:19.2817170Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2817431Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2817535Z                            module_map=module_map)
2025-05-07T20:33:19.2817695Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2817787Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2817867Z E       ^
2025-05-07T20:33:19.2818212Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2818216Z 
2025-05-07T20:33:19.2818629Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2818633Z 
2025-05-07T20:33:19.2818738Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2818954Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2819035Z     T=16384,
2025-05-07T20:33:19.2819110Z     D=7168,
2025-05-07T20:33:19.2819195Z     scale_ub=1200.0,
2025-05-07T20:33:19.2819278Z     contiguous=True,
2025-05-07T20:33:19.2819360Z     compiled=True,
2025-05-07T20:33:19.2819432Z )
2025-05-07T20:33:19.2819650Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2819819Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:19.2819823Z 
2025-05-07T20:33:19.2819898Z     @given(
2025-05-07T20:33:19.2820016Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2820111Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2820222Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2820333Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2820445Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2820518Z     )
2025-05-07T20:33:19.2820801Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2820894Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2820972Z         self,
2025-05-07T20:33:19.2824430Z         T: int,
2025-05-07T20:33:19.2824515Z         D: int,
2025-05-07T20:33:19.2824618Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2824707Z         contiguous: bool,
2025-05-07T20:33:19.2824791Z         compiled: bool,
2025-05-07T20:33:19.2824872Z     ) -> None:
2025-05-07T20:33:19.2824966Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2825040Z     
2025-05-07T20:33:19.2825219Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2825292Z     
2025-05-07T20:33:19.2825388Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2825511Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2825607Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2825693Z         x0 = x[:, :D]
2025-05-07T20:33:19.2825772Z         x1 = x[:, D:]
2025-05-07T20:33:19.2825846Z     
2025-05-07T20:33:19.2825935Z         if contiguous:
2025-05-07T20:33:19.2826150Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2826237Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2826314Z     
2025-05-07T20:33:19.2826402Z         if scale_ub is not None:
2025-05-07T20:33:19.2826506Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2826639Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2826710Z             )
2025-05-07T20:33:19.2826790Z         else:
2025-05-07T20:33:19.2826882Z             scale_ub_tensor = None
2025-05-07T20:33:19.2826949Z     
2025-05-07T20:33:19.2827076Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2827163Z             op = silu_mul_quant
2025-05-07T20:33:19.2827242Z             if compiled:
2025-05-07T20:33:19.2827387Z                 op = torch.compile(op)
2025-05-07T20:33:19.2827576Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2827648Z     
2025-05-07T20:33:19.2827742Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2827747Z 
2025-05-07T20:33:19.2827845Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2827975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2828072Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2828168Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2828545Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.2828633Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.2829122Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2829221Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2829575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2829799Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2830139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2830231Z     kernel = self.compile(
2025-05-07T20:33:19.2830613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2830784Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2830912Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2830916Z 
2025-05-07T20:33:19.2831119Z self = <triton.compiler.compiler.ASTSource object at 0x7f359bf62fd0>
2025-05-07T20:33:19.2831884Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2832429Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f359bf05e40>}
2025-05-07T20:33:19.2833162Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2833354Z context = <triton._C.libtriton.ir.context object at 0x7f38d80031b0>
2025-05-07T20:33:19.2833359Z 
2025-05-07T20:33:19.2833520Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2833776Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2833885Z                            module_map=module_map)
2025-05-07T20:33:19.2834047Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2834141Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2834219Z E       ^
2025-05-07T20:33:19.2834610Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2834652Z 
2025-05-07T20:33:19.2835071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2835076Z 
2025-05-07T20:33:19.2835177Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2835393Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2835473Z     T=16384,
2025-05-07T20:33:19.2835546Z     D=5120,
2025-05-07T20:33:19.2835625Z     scale_ub=1200.0,
2025-05-07T20:33:19.2835712Z     contiguous=True,
2025-05-07T20:33:19.2835793Z     compiled=False,
2025-05-07T20:33:19.2835864Z )
2025-05-07T20:33:19.2836078Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2836292Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:19.2836297Z 
2025-05-07T20:33:19.2836378Z     @given(
2025-05-07T20:33:19.2836498Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2836595Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2836708Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2836819Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2836930Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2837000Z     )
2025-05-07T20:33:19.2837237Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2837329Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2837401Z         self,
2025-05-07T20:33:19.2837475Z         T: int,
2025-05-07T20:33:19.2837549Z         D: int,
2025-05-07T20:33:19.2837640Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2837729Z         contiguous: bool,
2025-05-07T20:33:19.2837812Z         compiled: bool,
2025-05-07T20:33:19.2837885Z     ) -> None:
2025-05-07T20:33:19.2837979Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2838051Z     
2025-05-07T20:33:19.2838215Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2838286Z     
2025-05-07T20:33:19.2838377Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2838497Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2838585Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2838658Z         x0 = x[:, :D]
2025-05-07T20:33:19.2838730Z         x1 = x[:, D:]
2025-05-07T20:33:19.2838798Z     
2025-05-07T20:33:19.2838877Z         if contiguous:
2025-05-07T20:33:19.2838965Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2839052Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2839119Z     
2025-05-07T20:33:19.2839203Z         if scale_ub is not None:
2025-05-07T20:33:19.2839311Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2839438Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2839512Z             )
2025-05-07T20:33:19.2839641Z         else:
2025-05-07T20:33:19.2839736Z             scale_ub_tensor = None
2025-05-07T20:33:19.2839808Z     
2025-05-07T20:33:19.2839933Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2840017Z             op = silu_mul_quant
2025-05-07T20:33:19.2840481Z             if compiled:
2025-05-07T20:33:19.2840625Z                 op = torch.compile(op)
2025-05-07T20:33:19.2840732Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2840807Z     
2025-05-07T20:33:19.2840897Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2840901Z 
2025-05-07T20:33:19.2840996Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2841125Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2841221Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2841321Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2841813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2842061Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2842421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2842639Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2842974Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2843065Z     kernel = self.compile(
2025-05-07T20:33:19.2843459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2843629Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2843750Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2843818Z 
2025-05-07T20:33:19.2844018Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d93cb450>
2025-05-07T20:33:19.2844785Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2845276Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f359bf06ca0>}
2025-05-07T20:33:19.2846037Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2846245Z context = <triton._C.libtriton.ir.context object at 0x7f359bb882b0>
2025-05-07T20:33:19.2846253Z 
2025-05-07T20:33:19.2846416Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2846674Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2846781Z                            module_map=module_map)
2025-05-07T20:33:19.2846939Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2847035Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2847109Z E       ^
2025-05-07T20:33:19.2847459Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2847463Z 
2025-05-07T20:33:19.2847873Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2847877Z 
2025-05-07T20:33:19.2847976Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2848191Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2848267Z     T=1,
2025-05-07T20:33:19.2848344Z     D=7168,
2025-05-07T20:33:19.2848423Z     scale_ub=1200.0,
2025-05-07T20:33:19.2848568Z     contiguous=False,
2025-05-07T20:33:19.2848654Z     compiled=False,
2025-05-07T20:33:19.2848722Z )
2025-05-07T20:33:19.2848933Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2849099Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:19.2849104Z 
2025-05-07T20:33:19.2849177Z     @given(
2025-05-07T20:33:19.2849292Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2849387Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2849497Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2849613Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2849721Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2849792Z     )
2025-05-07T20:33:19.2850036Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2850122Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2850195Z         self,
2025-05-07T20:33:19.2850275Z         T: int,
2025-05-07T20:33:19.2850394Z         D: int,
2025-05-07T20:33:19.2850525Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2850611Z         contiguous: bool,
2025-05-07T20:33:19.2850692Z         compiled: bool,
2025-05-07T20:33:19.2850771Z     ) -> None:
2025-05-07T20:33:19.2850861Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2850926Z     
2025-05-07T20:33:19.2851090Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2851160Z     
2025-05-07T20:33:19.2851246Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2851370Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2851452Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2851529Z         x0 = x[:, :D]
2025-05-07T20:33:19.2851605Z         x1 = x[:, D:]
2025-05-07T20:33:19.2851715Z     
2025-05-07T20:33:19.2851793Z         if contiguous:
2025-05-07T20:33:19.2851883Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2851969Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2852043Z     
2025-05-07T20:33:19.2852133Z         if scale_ub is not None:
2025-05-07T20:33:19.2852233Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2852365Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2852439Z             )
2025-05-07T20:33:19.2852513Z         else:
2025-05-07T20:33:19.2852604Z             scale_ub_tensor = None
2025-05-07T20:33:19.2852669Z     
2025-05-07T20:33:19.2852794Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2852882Z             op = silu_mul_quant
2025-05-07T20:33:19.2852962Z             if compiled:
2025-05-07T20:33:19.2853058Z                 op = torch.compile(op)
2025-05-07T20:33:19.2853161Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2853230Z     
2025-05-07T20:33:19.2853321Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2853325Z 
2025-05-07T20:33:19.2853416Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2853546Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2853649Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2853743Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2854228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2854326Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2854679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2854898Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2855228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2855320Z     kernel = self.compile(
2025-05-07T20:33:19.2855749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2855924Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2856048Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2856057Z 
2025-05-07T20:33:19.2856254Z self = <triton.compiler.compiler.ASTSource object at 0x7f359bda7650>
2025-05-07T20:33:19.2857014Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2857504Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d80c40e0>}
2025-05-07T20:33:19.2858243Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2858527Z context = <triton._C.libtriton.ir.context object at 0x7f38d804b270>
2025-05-07T20:33:19.2858532Z 
2025-05-07T20:33:19.2858690Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2858946Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2859053Z                            module_map=module_map)
2025-05-07T20:33:19.2859208Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2859305Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2859378Z E       ^
2025-05-07T20:33:19.2859726Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2859771Z 
2025-05-07T20:33:19.2860205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2860212Z 
2025-05-07T20:33:19.2860312Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2860531Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2860608Z     T=4096,
2025-05-07T20:33:19.2860679Z     D=7168,
2025-05-07T20:33:19.2860761Z     scale_ub=1200.0,
2025-05-07T20:33:19.2860843Z     contiguous=False,
2025-05-07T20:33:19.2860923Z     compiled=True,
2025-05-07T20:33:19.2860991Z )
2025-05-07T20:33:19.2861203Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2861372Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:19.2861377Z 
2025-05-07T20:33:19.2861453Z     @given(
2025-05-07T20:33:19.2861566Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2861664Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2861778Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2861896Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2862008Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2862082Z     )
2025-05-07T20:33:19.2862317Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2862407Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2862479Z         self,
2025-05-07T20:33:19.2862552Z         T: int,
2025-05-07T20:33:19.2862626Z         D: int,
2025-05-07T20:33:19.2862719Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2862805Z         contiguous: bool,
2025-05-07T20:33:19.2862891Z         compiled: bool,
2025-05-07T20:33:19.2862964Z     ) -> None:
2025-05-07T20:33:19.2863052Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2863127Z     
2025-05-07T20:33:19.2863289Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2863368Z     
2025-05-07T20:33:19.2863454Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2863618Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2863707Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2863785Z         x0 = x[:, :D]
2025-05-07T20:33:19.2863863Z         x1 = x[:, D:]
2025-05-07T20:33:19.2863934Z     
2025-05-07T20:33:19.2864011Z         if contiguous:
2025-05-07T20:33:19.2864096Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2864180Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2864245Z     
2025-05-07T20:33:19.2864328Z         if scale_ub is not None:
2025-05-07T20:33:19.2864429Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2864558Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2864630Z             )
2025-05-07T20:33:19.2864703Z         else:
2025-05-07T20:33:19.2864790Z             scale_ub_tensor = None
2025-05-07T20:33:19.2864860Z     
2025-05-07T20:33:19.2864988Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2865073Z             op = silu_mul_quant
2025-05-07T20:33:19.2865157Z             if compiled:
2025-05-07T20:33:19.2865255Z                 op = torch.compile(op)
2025-05-07T20:33:19.2865437Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2865511Z     
2025-05-07T20:33:19.2865597Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2865602Z 
2025-05-07T20:33:19.2865692Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2865819Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2865913Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2866008Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2866367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.2866455Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.2866939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2867073Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2867478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2867702Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2868035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2868127Z     kernel = self.compile(
2025-05-07T20:33:19.2868519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2868687Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2868810Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2868814Z 
2025-05-07T20:33:19.2869012Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d91fb550>
2025-05-07T20:33:19.2869779Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2870270Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d80c5300>}
2025-05-07T20:33:19.2871001Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2871187Z context = <triton._C.libtriton.ir.context object at 0x7f359bdfe230>
2025-05-07T20:33:19.2871192Z 
2025-05-07T20:33:19.2871349Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2871608Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2871755Z                            module_map=module_map)
2025-05-07T20:33:19.2871916Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2872015Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2872089Z E       ^
2025-05-07T20:33:19.2872436Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2872440Z 
2025-05-07T20:33:19.2872842Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2872846Z 
2025-05-07T20:33:19.2872943Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2873160Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2873234Z     T=128,
2025-05-07T20:33:19.2873306Z     D=7168,
2025-05-07T20:33:19.2873388Z     scale_ub=1200.0,
2025-05-07T20:33:19.2873469Z     contiguous=False,
2025-05-07T20:33:19.2873548Z     compiled=True,
2025-05-07T20:33:19.2873616Z )
2025-05-07T20:33:19.2873830Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2874079Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:19.2874084Z 
2025-05-07T20:33:19.2874159Z     @given(
2025-05-07T20:33:19.2874272Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2874368Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2874477Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2874586Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2874696Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2874767Z     )
2025-05-07T20:33:19.2875005Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2875092Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2875206Z         self,
2025-05-07T20:33:19.2875280Z         T: int,
2025-05-07T20:33:19.2875351Z         D: int,
2025-05-07T20:33:19.2875445Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2875532Z         contiguous: bool,
2025-05-07T20:33:19.2875614Z         compiled: bool,
2025-05-07T20:33:19.2875687Z     ) -> None:
2025-05-07T20:33:19.2875778Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2875845Z     
2025-05-07T20:33:19.2876006Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2876083Z     
2025-05-07T20:33:19.2876183Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2876324Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2876412Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2876484Z         x0 = x[:, :D]
2025-05-07T20:33:19.2876561Z         x1 = x[:, D:]
2025-05-07T20:33:19.2876628Z     
2025-05-07T20:33:19.2876706Z         if contiguous:
2025-05-07T20:33:19.2876793Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2876879Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2876944Z     
2025-05-07T20:33:19.2877030Z         if scale_ub is not None:
2025-05-07T20:33:19.2877133Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2877269Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2877342Z             )
2025-05-07T20:33:19.2877412Z         else:
2025-05-07T20:33:19.2877502Z             scale_ub_tensor = None
2025-05-07T20:33:19.2877571Z     
2025-05-07T20:33:19.2877694Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2877783Z             op = silu_mul_quant
2025-05-07T20:33:19.2877863Z             if compiled:
2025-05-07T20:33:19.2877959Z                 op = torch.compile(op)
2025-05-07T20:33:19.2878060Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2878125Z     
2025-05-07T20:33:19.2878210Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2878214Z 
2025-05-07T20:33:19.2878309Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2878433Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2878576Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2878669Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2879031Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.2879119Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.2879602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2879695Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2880048Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2880263Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2880594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2880684Z     kernel = self.compile(
2025-05-07T20:33:19.2881081Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2881335Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2881456Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2881461Z 
2025-05-07T20:33:19.2881662Z self = <triton.compiler.compiler.ASTSource object at 0x7f359be29cd0>
2025-05-07T20:33:19.2882430Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2882917Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d80c6160>}
2025-05-07T20:33:19.2883696Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2883882Z context = <triton._C.libtriton.ir.context object at 0x7f359bdfb330>
2025-05-07T20:33:19.2883886Z 
2025-05-07T20:33:19.2884048Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2884304Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2884409Z                            module_map=module_map)
2025-05-07T20:33:19.2884573Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2884667Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2884740Z E       ^
2025-05-07T20:33:19.2885088Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2885095Z 
2025-05-07T20:33:19.2885506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2885514Z 
2025-05-07T20:33:19.2885617Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2885830Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2885899Z     T=2048,
2025-05-07T20:33:19.2885979Z     D=7168,
2025-05-07T20:33:19.2886057Z     scale_ub=None,
2025-05-07T20:33:19.2886140Z     contiguous=True,
2025-05-07T20:33:19.2886226Z     compiled=True,
2025-05-07T20:33:19.2886295Z )
2025-05-07T20:33:19.2886514Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2886681Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:19.2886686Z 
2025-05-07T20:33:19.2886761Z     @given(
2025-05-07T20:33:19.2886881Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2886978Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2887156Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2887276Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2887390Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2887460Z     )
2025-05-07T20:33:19.2887703Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2887791Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2887870Z         self,
2025-05-07T20:33:19.2887943Z         T: int,
2025-05-07T20:33:19.2888015Z         D: int,
2025-05-07T20:33:19.2888115Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2888203Z         contiguous: bool,
2025-05-07T20:33:19.2888283Z         compiled: bool,
2025-05-07T20:33:19.2888361Z     ) -> None:
2025-05-07T20:33:19.2888450Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2888520Z     
2025-05-07T20:33:19.2888689Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2888759Z     
2025-05-07T20:33:19.2888846Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2888974Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2889140Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2889224Z         x0 = x[:, :D]
2025-05-07T20:33:19.2889299Z         x1 = x[:, D:]
2025-05-07T20:33:19.2889366Z     
2025-05-07T20:33:19.2889452Z         if contiguous:
2025-05-07T20:33:19.2889540Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2889626Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2889699Z     
2025-05-07T20:33:19.2889785Z         if scale_ub is not None:
2025-05-07T20:33:19.2889887Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2890022Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2890095Z             )
2025-05-07T20:33:19.2890170Z         else:
2025-05-07T20:33:19.2890266Z             scale_ub_tensor = None
2025-05-07T20:33:19.2890376Z     
2025-05-07T20:33:19.2890503Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2890591Z             op = silu_mul_quant
2025-05-07T20:33:19.2890675Z             if compiled:
2025-05-07T20:33:19.2890780Z                 op = torch.compile(op)
2025-05-07T20:33:19.2890882Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2890953Z     
2025-05-07T20:33:19.2891046Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2891051Z 
2025-05-07T20:33:19.2891146Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2891271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2891372Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2891467Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2891837Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.2891927Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.2892415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2892516Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2892876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2893094Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2893434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2893525Z     kernel = self.compile(
2025-05-07T20:33:19.2893929Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2894098Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2894221Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2894229Z 
2025-05-07T20:33:19.2894432Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d814e350>
2025-05-07T20:33:19.2895237Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2895731Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f38d80c7420>}
2025-05-07T20:33:19.2896463Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2896649Z context = <triton._C.libtriton.ir.context object at 0x7f359bcfc930>
2025-05-07T20:33:19.2896658Z 
2025-05-07T20:33:19.2896818Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2897077Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2897188Z                            module_map=module_map)
2025-05-07T20:33:19.2897431Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2897526Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2897605Z E       ^
2025-05-07T20:33:19.2897952Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2897956Z 
2025-05-07T20:33:19.2898370Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2898375Z 
2025-05-07T20:33:19.2898473Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2898690Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2898812Z     T=16384,
2025-05-07T20:33:19.2898887Z     D=5120,
2025-05-07T20:33:19.2898965Z     scale_ub=None,
2025-05-07T20:33:19.2899053Z     contiguous=False,
2025-05-07T20:33:19.2899141Z     compiled=False,
2025-05-07T20:33:19.2899208Z )
2025-05-07T20:33:19.2899433Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2899608Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:19.2899612Z 
2025-05-07T20:33:19.2899694Z     @given(
2025-05-07T20:33:19.2899809Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2899909Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2900027Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2900141Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2900251Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2900330Z     )
2025-05-07T20:33:19.2900570Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2900669Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2900744Z         self,
2025-05-07T20:33:19.2900817Z         T: int,
2025-05-07T20:33:19.2900901Z         D: int,
2025-05-07T20:33:19.2901001Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2901087Z         contiguous: bool,
2025-05-07T20:33:19.2901179Z         compiled: bool,
2025-05-07T20:33:19.2901258Z     ) -> None:
2025-05-07T20:33:19.2901355Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2901434Z     
2025-05-07T20:33:19.2901596Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2901666Z     
2025-05-07T20:33:19.2901763Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2901885Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2903721Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.2903734Z 
2025-05-07T20:33:19.2903849Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:19.2903854Z 
2025-05-07T20:33:19.2903957Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2904175Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2904248Z     T=4096,
2025-05-07T20:33:19.2904327Z     D=7168,
2025-05-07T20:33:19.2904405Z     scale_ub=1200.0,
2025-05-07T20:33:19.2904483Z     contiguous=True,
2025-05-07T20:33:19.2904564Z     compiled=True,
2025-05-07T20:33:19.2904631Z )
2025-05-07T20:33:19.2904842Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2905010Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:19.2905014Z 
2025-05-07T20:33:19.2905088Z     @given(
2025-05-07T20:33:19.2905287Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2905383Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2905492Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2905606Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2905714Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2905785Z     )
2025-05-07T20:33:19.2906028Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2906118Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2906190Z         self,
2025-05-07T20:33:19.2906272Z         T: int,
2025-05-07T20:33:19.2906344Z         D: int,
2025-05-07T20:33:19.2906436Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2906642Z         contiguous: bool,
2025-05-07T20:33:19.2906725Z         compiled: bool,
2025-05-07T20:33:19.2906806Z     ) -> None:
2025-05-07T20:33:19.2906899Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2906971Z     
2025-05-07T20:33:19.2907145Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2907214Z     
2025-05-07T20:33:19.2907301Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2907473Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2909233Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.2909242Z 
2025-05-07T20:33:19.2909360Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:19.2909365Z 
2025-05-07T20:33:19.2909466Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2909680Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2909756Z     T=16384,
2025-05-07T20:33:19.2909823Z     D=7168,
2025-05-07T20:33:19.2909899Z     scale_ub=None,
2025-05-07T20:33:19.2909980Z     contiguous=False,
2025-05-07T20:33:19.2910059Z     compiled=False,
2025-05-07T20:33:19.2910126Z )
2025-05-07T20:33:19.2910335Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2910503Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:19.2910508Z 
2025-05-07T20:33:19.2910582Z     @given(
2025-05-07T20:33:19.2910694Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2910792Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2910907Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2911065Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2911182Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2911254Z     )
2025-05-07T20:33:19.2911491Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2911584Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2911656Z         self,
2025-05-07T20:33:19.2911730Z         T: int,
2025-05-07T20:33:19.2911807Z         D: int,
2025-05-07T20:33:19.2911899Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2911986Z         contiguous: bool,
2025-05-07T20:33:19.2912075Z         compiled: bool,
2025-05-07T20:33:19.2912149Z     ) -> None:
2025-05-07T20:33:19.2912241Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2912315Z     
2025-05-07T20:33:19.2912475Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2914287Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.2914331Z 
2025-05-07T20:33:19.2914448Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.2914452Z 
2025-05-07T20:33:19.2914551Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2914768Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2914844Z     T=2048,
2025-05-07T20:33:19.2914983Z     D=7168,
2025-05-07T20:33:19.2915065Z     scale_ub=1200.0,
2025-05-07T20:33:19.2915147Z     contiguous=True,
2025-05-07T20:33:19.2915234Z     compiled=True,
2025-05-07T20:33:19.2915312Z )
2025-05-07T20:33:19.2915526Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2915696Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:19.2915700Z 
2025-05-07T20:33:19.2915772Z     @given(
2025-05-07T20:33:19.2915890Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2915999Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2916120Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2916262Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2916369Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2916443Z     )
2025-05-07T20:33:19.2916682Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2916773Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2916846Z         self,
2025-05-07T20:33:19.2916919Z         T: int,
2025-05-07T20:33:19.2916990Z         D: int,
2025-05-07T20:33:19.2917089Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2917179Z         contiguous: bool,
2025-05-07T20:33:19.2917259Z         compiled: bool,
2025-05-07T20:33:19.2917335Z     ) -> None:
2025-05-07T20:33:19.2917428Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2917496Z     
2025-05-07T20:33:19.2917665Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2917735Z     
2025-05-07T20:33:19.2917823Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2917945Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2919733Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.2919744Z 
2025-05-07T20:33:19.2919864Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:19.2919869Z 
2025-05-07T20:33:19.2919965Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2920182Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2920254Z     T=2048,
2025-05-07T20:33:19.2920328Z     D=7168,
2025-05-07T20:33:19.2920410Z     scale_ub=None,
2025-05-07T20:33:19.2920489Z     contiguous=True,
2025-05-07T20:33:19.2920570Z     compiled=False,
2025-05-07T20:33:19.2920639Z )
2025-05-07T20:33:19.2920848Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2921014Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:19.2921018Z 
2025-05-07T20:33:19.2921096Z     @given(
2025-05-07T20:33:19.2921210Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2921385Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2921493Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2921605Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2921716Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2921788Z     )
2025-05-07T20:33:19.2922028Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2922121Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2922192Z         self,
2025-05-07T20:33:19.2922266Z         T: int,
2025-05-07T20:33:19.2922338Z         D: int,
2025-05-07T20:33:19.2922429Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2922513Z         contiguous: bool,
2025-05-07T20:33:19.2922641Z         compiled: bool,
2025-05-07T20:33:19.2922717Z     ) -> None:
2025-05-07T20:33:19.2922809Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2922879Z     
2025-05-07T20:33:19.2923043Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2923119Z     
2025-05-07T20:33:19.2923204Z >       x_sign = torch.sign(x)
2025-05-07T20:33:19.2924942Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.2924953Z 
2025-05-07T20:33:19.2925063Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:19.2925068Z 
2025-05-07T20:33:19.2925164Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2925385Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2925460Z     T=1,
2025-05-07T20:33:19.2925533Z     D=7168,
2025-05-07T20:33:19.2925613Z     scale_ub=1200.0,
2025-05-07T20:33:19.2925693Z     contiguous=True,
2025-05-07T20:33:19.2925777Z     compiled=False,
2025-05-07T20:33:19.2925848Z )
2025-05-07T20:33:19.2926058Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2926219Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:19.2926223Z 
2025-05-07T20:33:19.2926294Z     @given(
2025-05-07T20:33:19.2926404Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2926501Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2926608Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2926723Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2926882Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2926956Z     )
2025-05-07T20:33:19.2927205Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2927294Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2927368Z         self,
2025-05-07T20:33:19.2927448Z         T: int,
2025-05-07T20:33:19.2927521Z         D: int,
2025-05-07T20:33:19.2927613Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2927702Z         contiguous: bool,
2025-05-07T20:33:19.2927781Z         compiled: bool,
2025-05-07T20:33:19.2927855Z     ) -> None:
2025-05-07T20:33:19.2927950Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2928020Z     
2025-05-07T20:33:19.2928182Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2928260Z     
2025-05-07T20:33:19.2928347Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2928475Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2928560Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2928638Z         x0 = x[:, :D]
2025-05-07T20:33:19.2928716Z         x1 = x[:, D:]
2025-05-07T20:33:19.2928866Z     
2025-05-07T20:33:19.2928946Z         if contiguous:
2025-05-07T20:33:19.2929035Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2929119Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2929191Z     
2025-05-07T20:33:19.2929281Z         if scale_ub is not None:
2025-05-07T20:33:19.2929379Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2929508Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2929585Z             )
2025-05-07T20:33:19.2929657Z         else:
2025-05-07T20:33:19.2929747Z             scale_ub_tensor = None
2025-05-07T20:33:19.2929818Z     
2025-05-07T20:33:19.2929941Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2930072Z             op = silu_mul_quant
2025-05-07T20:33:19.2930153Z             if compiled:
2025-05-07T20:33:19.2930249Z                 op = torch.compile(op)
2025-05-07T20:33:19.2930354Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2930424Z     
2025-05-07T20:33:19.2930512Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2930516Z 
2025-05-07T20:33:19.2930612Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2930735Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2930828Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2930922Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2931415Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2931510Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2931865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2932084Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2932429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2932520Z     kernel = self.compile(
2025-05-07T20:33:19.2932905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2933074Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2933198Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2933202Z 
2025-05-07T20:33:19.2933405Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d93cb0d0>
2025-05-07T20:33:19.2934169Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2934714Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f359bc162a0>}
2025-05-07T20:33:19.2935451Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2935641Z context = <triton._C.libtriton.ir.context object at 0x7f359bebf7f0>
2025-05-07T20:33:19.2935645Z 
2025-05-07T20:33:19.2935809Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2936067Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2936178Z                            module_map=module_map)
2025-05-07T20:33:19.2936334Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2936433Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2936516Z E       ^
2025-05-07T20:33:19.2936865Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2936911Z 
2025-05-07T20:33:19.2937366Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2937377Z 
2025-05-07T20:33:19.2937478Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2937694Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2937776Z     T=128,
2025-05-07T20:33:19.2937852Z     D=5120,
2025-05-07T20:33:19.2937930Z     scale_ub=None,
2025-05-07T20:33:19.2938012Z     contiguous=True,
2025-05-07T20:33:19.2938091Z     compiled=False,
2025-05-07T20:33:19.2938157Z )
2025-05-07T20:33:19.2938374Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2938583Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:19.2938588Z 
2025-05-07T20:33:19.2938664Z     @given(
2025-05-07T20:33:19.2938779Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2938878Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2938990Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2939102Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2939213Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2939288Z     )
2025-05-07T20:33:19.2939526Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2939615Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2939696Z         self,
2025-05-07T20:33:19.2939771Z         T: int,
2025-05-07T20:33:19.2939847Z         D: int,
2025-05-07T20:33:19.2939943Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2940026Z         contiguous: bool,
2025-05-07T20:33:19.2940489Z         compiled: bool,
2025-05-07T20:33:19.2940601Z     ) -> None:
2025-05-07T20:33:19.2940724Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2940825Z     
2025-05-07T20:33:19.2941027Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2941106Z     
2025-05-07T20:33:19.2941200Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2944698Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2944801Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2944882Z         x0 = x[:, :D]
2025-05-07T20:33:19.2944961Z         x1 = x[:, D:]
2025-05-07T20:33:19.2945030Z     
2025-05-07T20:33:19.2945113Z         if contiguous:
2025-05-07T20:33:19.2945202Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2945290Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2945363Z     
2025-05-07T20:33:19.2945452Z         if scale_ub is not None:
2025-05-07T20:33:19.2945559Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2945691Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2945773Z             )
2025-05-07T20:33:19.2945850Z         else:
2025-05-07T20:33:19.2946077Z             scale_ub_tensor = None
2025-05-07T20:33:19.2946157Z     
2025-05-07T20:33:19.2946307Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2946396Z             op = silu_mul_quant
2025-05-07T20:33:19.2946478Z             if compiled:
2025-05-07T20:33:19.2946581Z                 op = torch.compile(op)
2025-05-07T20:33:19.2946687Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2946761Z     
2025-05-07T20:33:19.2946849Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2946854Z 
2025-05-07T20:33:19.2946948Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2947076Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2947174Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2947269Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2947855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2947954Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2948381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2948678Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2949015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2949111Z     kernel = self.compile(
2025-05-07T20:33:19.2949490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2949660Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2949787Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2949853Z 
2025-05-07T20:33:19.2950057Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d81344d0>
2025-05-07T20:33:19.2950825Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2951319Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f359bc171a0>}
2025-05-07T20:33:19.2952053Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2952237Z context = <triton._C.libtriton.ir.context object at 0x7f359beecd30>
2025-05-07T20:33:19.2952242Z 
2025-05-07T20:33:19.2952401Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2952666Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2952774Z                            module_map=module_map)
2025-05-07T20:33:19.2952936Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2953036Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2953110Z E       ^
2025-05-07T20:33:19.2953462Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2953467Z 
2025-05-07T20:33:19.2953883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2953887Z 
2025-05-07T20:33:19.2953986Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2954207Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2954285Z     T=128,
2025-05-07T20:33:19.2954364Z     D=7168,
2025-05-07T20:33:19.2954444Z     scale_ub=None,
2025-05-07T20:33:19.2954525Z     contiguous=True,
2025-05-07T20:33:19.2954608Z     compiled=False,
2025-05-07T20:33:19.2954724Z )
2025-05-07T20:33:19.2954941Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2955112Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:19.2955117Z 
2025-05-07T20:33:19.2955191Z     @given(
2025-05-07T20:33:19.2955304Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2955405Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2955514Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2955630Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2955740Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2955813Z     )
2025-05-07T20:33:19.2956055Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2956148Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2956223Z         self,
2025-05-07T20:33:19.2956306Z         T: int,
2025-05-07T20:33:19.2956383Z         D: int,
2025-05-07T20:33:19.2956476Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2956646Z         contiguous: bool,
2025-05-07T20:33:19.2956729Z         compiled: bool,
2025-05-07T20:33:19.2956807Z     ) -> None:
2025-05-07T20:33:19.2956901Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2956973Z     
2025-05-07T20:33:19.2957134Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2957212Z     
2025-05-07T20:33:19.2957300Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2957422Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2957507Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2957581Z         x0 = x[:, :D]
2025-05-07T20:33:19.2957657Z         x1 = x[:, D:]
2025-05-07T20:33:19.2957726Z     
2025-05-07T20:33:19.2957806Z         if contiguous:
2025-05-07T20:33:19.2957938Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2958022Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2958087Z     
2025-05-07T20:33:19.2958179Z         if scale_ub is not None:
2025-05-07T20:33:19.2958287Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2958419Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2958495Z             )
2025-05-07T20:33:19.2958569Z         else:
2025-05-07T20:33:19.2958660Z             scale_ub_tensor = None
2025-05-07T20:33:19.2958729Z     
2025-05-07T20:33:19.2958852Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2958937Z             op = silu_mul_quant
2025-05-07T20:33:19.2959017Z             if compiled:
2025-05-07T20:33:19.2959112Z                 op = torch.compile(op)
2025-05-07T20:33:19.2959215Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2959284Z     
2025-05-07T20:33:19.2959373Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2959381Z 
2025-05-07T20:33:19.2959479Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2959604Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2959708Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2959808Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2960295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2960397Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2960750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2960967Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2961308Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2961400Z     kernel = self.compile(
2025-05-07T20:33:19.2961787Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2962001Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2962129Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2962133Z 
2025-05-07T20:33:19.2962337Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d93c8cd0>
2025-05-07T20:33:19.2963102Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2963593Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f359be10040>}
2025-05-07T20:33:19.2964324Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2964515Z context = <triton._C.libtriton.ir.context object at 0x7f359be83470>
2025-05-07T20:33:19.2964604Z 
2025-05-07T20:33:19.2964763Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2965020Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2965128Z                            module_map=module_map)
2025-05-07T20:33:19.2965284Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2965379Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2965459Z E       ^
2025-05-07T20:33:19.2965806Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2965811Z 
2025-05-07T20:33:19.2966228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2966271Z 
2025-05-07T20:33:19.2966369Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2966590Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2966671Z     T=2048,
2025-05-07T20:33:19.2966743Z     D=7168,
2025-05-07T20:33:19.2966822Z     scale_ub=1200.0,
2025-05-07T20:33:19.2966906Z     contiguous=True,
2025-05-07T20:33:19.2966984Z     compiled=False,
2025-05-07T20:33:19.2967053Z )
2025-05-07T20:33:19.2967267Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2967435Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:19.2967440Z 
2025-05-07T20:33:19.2967515Z     @given(
2025-05-07T20:33:19.2967626Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2967722Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2967837Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2967952Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2968064Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2968138Z     )
2025-05-07T20:33:19.2968379Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2968468Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2968545Z         self,
2025-05-07T20:33:19.2968619Z         T: int,
2025-05-07T20:33:19.2968699Z         D: int,
2025-05-07T20:33:19.2968793Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2968876Z         contiguous: bool,
2025-05-07T20:33:19.2968959Z         compiled: bool,
2025-05-07T20:33:19.2969031Z     ) -> None:
2025-05-07T20:33:19.2969122Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2969195Z     
2025-05-07T20:33:19.2969356Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2971167Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.2971178Z 
2025-05-07T20:33:19.2971290Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.2971295Z 
2025-05-07T20:33:19.2971389Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2971606Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2971679Z     T=1,
2025-05-07T20:33:19.2971753Z     D=5120,
2025-05-07T20:33:19.2971829Z     scale_ub=1200.0,
2025-05-07T20:33:19.2971905Z     contiguous=True,
2025-05-07T20:33:19.2971987Z     compiled=False,
2025-05-07T20:33:19.2972056Z )
2025-05-07T20:33:19.2972267Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2972469Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:19.2972509Z 
2025-05-07T20:33:19.2972583Z     @given(
2025-05-07T20:33:19.2972693Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2972788Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2972895Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2973009Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2973122Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2973194Z     )
2025-05-07T20:33:19.2973434Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2973521Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2973592Z         self,
2025-05-07T20:33:19.2973709Z         T: int,
2025-05-07T20:33:19.2973781Z         D: int,
2025-05-07T20:33:19.2973874Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2973961Z         contiguous: bool,
2025-05-07T20:33:19.2974042Z         compiled: bool,
2025-05-07T20:33:19.2974119Z     ) -> None:
2025-05-07T20:33:19.2974218Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2974288Z     
2025-05-07T20:33:19.2974457Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2974528Z     
2025-05-07T20:33:19.2974619Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.2974742Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.2974829Z         x = x_sign * x_clamp
2025-05-07T20:33:19.2974908Z         x0 = x[:, :D]
2025-05-07T20:33:19.2974990Z         x1 = x[:, D:]
2025-05-07T20:33:19.2975057Z     
2025-05-07T20:33:19.2975136Z         if contiguous:
2025-05-07T20:33:19.2975230Z             x0 = x0.contiguous()
2025-05-07T20:33:19.2975317Z             x1 = x1.contiguous()
2025-05-07T20:33:19.2975389Z     
2025-05-07T20:33:19.2975478Z         if scale_ub is not None:
2025-05-07T20:33:19.2975580Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.2975712Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.2975790Z             )
2025-05-07T20:33:19.2975864Z         else:
2025-05-07T20:33:19.2975957Z             scale_ub_tensor = None
2025-05-07T20:33:19.2976026Z     
2025-05-07T20:33:19.2976172Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.2976267Z             op = silu_mul_quant
2025-05-07T20:33:19.2976369Z             if compiled:
2025-05-07T20:33:19.2976466Z                 op = torch.compile(op)
2025-05-07T20:33:19.2976570Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2976636Z     
2025-05-07T20:33:19.2976724Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.2976729Z 
2025-05-07T20:33:19.2976824Z moe/activation_test.py:117: 
2025-05-07T20:33:19.2976947Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2977047Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.2977140Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.2977675Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.2977774Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.2978127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.2978343Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.2978683Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.2978773Z     kernel = self.compile(
2025-05-07T20:33:19.2979175Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.2979346Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.2979469Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.2979474Z 
2025-05-07T20:33:19.2979773Z self = <triton.compiler.compiler.ASTSource object at 0x7f359bda46d0>
2025-05-07T20:33:19.2980536Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.2981030Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f359be11580>}
2025-05-07T20:33:19.2981756Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.2981981Z context = <triton._C.libtriton.ir.context object at 0x7f359b991bb0>
2025-05-07T20:33:19.2981990Z 
2025-05-07T20:33:19.2982149Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.2982409Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.2982515Z                            module_map=module_map)
2025-05-07T20:33:19.2982671Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.2982764Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.2982840Z E       ^
2025-05-07T20:33:19.2983186Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.2983191Z 
2025-05-07T20:33:19.2983602Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.2983606Z 
2025-05-07T20:33:19.2983707Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2983921Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2984005Z     T=2048,
2025-05-07T20:33:19.2984078Z     D=5120,
2025-05-07T20:33:19.2984161Z     scale_ub=None,
2025-05-07T20:33:19.2984245Z     contiguous=True,
2025-05-07T20:33:19.2984326Z     compiled=False,
2025-05-07T20:33:19.2984395Z )
2025-05-07T20:33:19.2984612Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2984779Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:19.2984783Z 
2025-05-07T20:33:19.2984861Z     @given(
2025-05-07T20:33:19.2984974Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2985068Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2985178Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2985289Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2985399Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2985474Z     )
2025-05-07T20:33:19.2985755Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2985850Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2985929Z         self,
2025-05-07T20:33:19.2986003Z         T: int,
2025-05-07T20:33:19.2986079Z         D: int,
2025-05-07T20:33:19.2986171Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2986254Z         contiguous: bool,
2025-05-07T20:33:19.2986335Z         compiled: bool,
2025-05-07T20:33:19.2986409Z     ) -> None:
2025-05-07T20:33:19.2986495Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2986567Z     
2025-05-07T20:33:19.2986727Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2986796Z     
2025-05-07T20:33:19.2986883Z >       x_sign = torch.sign(x)
2025-05-07T20:33:19.2988730Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.2988773Z 
2025-05-07T20:33:19.2988890Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:19.2988895Z 
2025-05-07T20:33:19.2988995Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2989215Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2989288Z     T=16384,
2025-05-07T20:33:19.2989362Z     D=5120,
2025-05-07T20:33:19.2989441Z     scale_ub=None,
2025-05-07T20:33:19.2989522Z     contiguous=True,
2025-05-07T20:33:19.2989602Z     compiled=False,
2025-05-07T20:33:19.2989719Z )
2025-05-07T20:33:19.2989932Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2990108Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:19.2990114Z 
2025-05-07T20:33:19.2990193Z     @given(
2025-05-07T20:33:19.2990306Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2990400Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2990509Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2990618Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2990730Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2990801Z     )
2025-05-07T20:33:19.2991041Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2991133Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2991206Z         self,
2025-05-07T20:33:19.2991281Z         T: int,
2025-05-07T20:33:19.2991360Z         D: int,
2025-05-07T20:33:19.2991452Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2991535Z         contiguous: bool,
2025-05-07T20:33:19.2991623Z         compiled: bool,
2025-05-07T20:33:19.2991697Z     ) -> None:
2025-05-07T20:33:19.2991793Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2991860Z     
2025-05-07T20:33:19.2992020Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2993771Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.2993779Z 
2025-05-07T20:33:19.2993890Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.2993895Z 
2025-05-07T20:33:19.2994038Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2994259Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2994332Z     T=4096,
2025-05-07T20:33:19.2994408Z     D=5120,
2025-05-07T20:33:19.2994484Z     scale_ub=None,
2025-05-07T20:33:19.2994564Z     contiguous=True,
2025-05-07T20:33:19.2994644Z     compiled=False,
2025-05-07T20:33:19.2994713Z )
2025-05-07T20:33:19.2994925Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.2995087Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:19.2995092Z 
2025-05-07T20:33:19.2995161Z     @given(
2025-05-07T20:33:19.2995276Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.2995370Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.2995480Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.2995593Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.2995701Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.2995813Z     )
2025-05-07T20:33:19.2996091Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.2996180Z     def test_silu_mul_quant(
2025-05-07T20:33:19.2996254Z         self,
2025-05-07T20:33:19.2996325Z         T: int,
2025-05-07T20:33:19.2996395Z         D: int,
2025-05-07T20:33:19.2996490Z         scale_ub: Optional[float],
2025-05-07T20:33:19.2996574Z         contiguous: bool,
2025-05-07T20:33:19.2996654Z         compiled: bool,
2025-05-07T20:33:19.2996730Z     ) -> None:
2025-05-07T20:33:19.2996822Z         torch.manual_seed(2025)
2025-05-07T20:33:19.2996891Z     
2025-05-07T20:33:19.2997054Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.2998840Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.2998848Z 
2025-05-07T20:33:19.2998961Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.2998966Z 
2025-05-07T20:33:19.2999063Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.2999280Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.2999355Z     T=2048,
2025-05-07T20:33:19.2999429Z     D=5120,
2025-05-07T20:33:19.2999509Z     scale_ub=None,
2025-05-07T20:33:19.2999591Z     contiguous=False,
2025-05-07T20:33:19.2999669Z     compiled=False,
2025-05-07T20:33:19.2999742Z )
2025-05-07T20:33:19.2999954Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.3000123Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:19.3000128Z 
2025-05-07T20:33:19.3000201Z     @given(
2025-05-07T20:33:19.3000314Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.3000409Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.3000517Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.3000627Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.3000737Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.3000808Z     )
2025-05-07T20:33:19.3001044Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.3001134Z     def test_silu_mul_quant(
2025-05-07T20:33:19.3001210Z         self,
2025-05-07T20:33:19.3001281Z         T: int,
2025-05-07T20:33:19.3001355Z         D: int,
2025-05-07T20:33:19.3001447Z         scale_ub: Optional[float],
2025-05-07T20:33:19.3001995Z         contiguous: bool,
2025-05-07T20:33:19.3002083Z         compiled: bool,
2025-05-07T20:33:19.3002161Z     ) -> None:
2025-05-07T20:33:19.3002251Z         torch.manual_seed(2025)
2025-05-07T20:33:19.3002318Z     
2025-05-07T20:33:19.3002478Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.3004213Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.3004222Z 
2025-05-07T20:33:19.3004336Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.3004341Z 
2025-05-07T20:33:19.3004523Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.3004737Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.3004807Z     T=4096,
2025-05-07T20:33:19.3004882Z     D=7168,
2025-05-07T20:33:19.3004959Z     scale_ub=None,
2025-05-07T20:33:19.3005038Z     contiguous=True,
2025-05-07T20:33:19.3005124Z     compiled=True,
2025-05-07T20:33:19.3005195Z )
2025-05-07T20:33:19.3005410Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.3005570Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:19.3005575Z 
2025-05-07T20:33:19.3005645Z     @given(
2025-05-07T20:33:19.3005760Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.3005898Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.3006008Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.3006131Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.3006244Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.3006315Z     )
2025-05-07T20:33:19.3006556Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.3006646Z     def test_silu_mul_quant(
2025-05-07T20:33:19.3006728Z         self,
2025-05-07T20:33:19.3006802Z         T: int,
2025-05-07T20:33:19.3006875Z         D: int,
2025-05-07T20:33:19.3006975Z         scale_ub: Optional[float],
2025-05-07T20:33:19.3007061Z         contiguous: bool,
2025-05-07T20:33:19.3007142Z         compiled: bool,
2025-05-07T20:33:19.3007224Z     ) -> None:
2025-05-07T20:33:19.3007317Z         torch.manual_seed(2025)
2025-05-07T20:33:19.3007389Z     
2025-05-07T20:33:19.3007551Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.3009302Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.3009311Z 
2025-05-07T20:33:19.3009429Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.3009433Z 
2025-05-07T20:33:19.3009531Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.3009751Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.3009825Z     T=2048,
2025-05-07T20:33:19.3009905Z     D=5120,
2025-05-07T20:33:19.3009987Z     scale_ub=1200.0,
2025-05-07T20:33:19.3010069Z     contiguous=False,
2025-05-07T20:33:19.3010149Z     compiled=False,
2025-05-07T20:33:19.3010291Z )
2025-05-07T20:33:19.3010507Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.3010679Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:19.3010684Z 
2025-05-07T20:33:19.3010763Z     @given(
2025-05-07T20:33:19.3010878Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.3010977Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.3011088Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.3011200Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.3011314Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.3011387Z     )
2025-05-07T20:33:19.3011623Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.3011717Z     def test_silu_mul_quant(
2025-05-07T20:33:19.3011791Z         self,
2025-05-07T20:33:19.3011866Z         T: int,
2025-05-07T20:33:19.3011942Z         D: int,
2025-05-07T20:33:19.3012040Z         scale_ub: Optional[float],
2025-05-07T20:33:19.3012204Z         contiguous: bool,
2025-05-07T20:33:19.3012292Z         compiled: bool,
2025-05-07T20:33:19.3012367Z     ) -> None:
2025-05-07T20:33:19.3012459Z         torch.manual_seed(2025)
2025-05-07T20:33:19.3012528Z     
2025-05-07T20:33:19.3012691Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.3014436Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.3014483Z 
2025-05-07T20:33:19.3014598Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.3014605Z 
2025-05-07T20:33:19.3014705Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.3014919Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.3014993Z     T=4096,
2025-05-07T20:33:19.3015071Z     D=7168,
2025-05-07T20:33:19.3015149Z     scale_ub=1200.0,
2025-05-07T20:33:19.3015229Z     contiguous=True,
2025-05-07T20:33:19.3015314Z     compiled=False,
2025-05-07T20:33:19.3015384Z )
2025-05-07T20:33:19.3015599Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.3015764Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:19.3015768Z 
2025-05-07T20:33:19.3015841Z     @given(
2025-05-07T20:33:19.3015961Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.3016058Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.3016169Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.3016293Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.3016403Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.3016475Z     )
2025-05-07T20:33:19.3016714Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.3016804Z     def test_silu_mul_quant(
2025-05-07T20:33:19.3016882Z         self,
2025-05-07T20:33:19.3016954Z         T: int,
2025-05-07T20:33:19.3017025Z         D: int,
2025-05-07T20:33:19.3017121Z         scale_ub: Optional[float],
2025-05-07T20:33:19.3017205Z         contiguous: bool,
2025-05-07T20:33:19.3017288Z         compiled: bool,
2025-05-07T20:33:19.3017365Z     ) -> None:
2025-05-07T20:33:19.3017455Z         torch.manual_seed(2025)
2025-05-07T20:33:19.3017526Z     
2025-05-07T20:33:19.3017687Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.3019472Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.3019481Z 
2025-05-07T20:33:19.3019595Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.3019599Z 
2025-05-07T20:33:19.3019696Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.3019916Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.3019994Z     T=16384,
2025-05-07T20:33:19.3020070Z     D=7168,
2025-05-07T20:33:19.3020150Z     scale_ub=None,
2025-05-07T20:33:19.3020235Z     contiguous=False,
2025-05-07T20:33:19.3020318Z     compiled=True,
2025-05-07T20:33:19.3020428Z )
2025-05-07T20:33:19.3020674Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.3020845Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:19.3020850Z 
2025-05-07T20:33:19.3020926Z     @given(
2025-05-07T20:33:19.3021037Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.3021133Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.3021242Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.3021352Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.3021463Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.3021533Z     )
2025-05-07T20:33:19.3021769Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.3021905Z     def test_silu_mul_quant(
2025-05-07T20:33:19.3021980Z         self,
2025-05-07T20:33:19.3022054Z         T: int,
2025-05-07T20:33:19.3022133Z         D: int,
2025-05-07T20:33:19.3022226Z         scale_ub: Optional[float],
2025-05-07T20:33:19.3022310Z         contiguous: bool,
2025-05-07T20:33:19.3022397Z         compiled: bool,
2025-05-07T20:33:19.3022471Z     ) -> None:
2025-05-07T20:33:19.3022566Z         torch.manual_seed(2025)
2025-05-07T20:33:19.3022633Z     
2025-05-07T20:33:19.3022794Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.3024537Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.3024550Z 
2025-05-07T20:33:19.3024661Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.3024665Z 
2025-05-07T20:33:19.3024768Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.3024982Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.3025056Z     T=4096,
2025-05-07T20:33:19.3025135Z     D=7168,
2025-05-07T20:33:19.3025213Z     scale_ub=None,
2025-05-07T20:33:19.3025294Z     contiguous=True,
2025-05-07T20:33:19.3025379Z     compiled=False,
2025-05-07T20:33:19.3025448Z )
2025-05-07T20:33:19.3025659Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.3025826Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:19.3025833Z 
2025-05-07T20:33:19.3025905Z     @given(
2025-05-07T20:33:19.3026021Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.3026158Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.3026272Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.3026387Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.3026494Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.3026566Z     )
2025-05-07T20:33:19.3026807Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.3026897Z     def test_silu_mul_quant(
2025-05-07T20:33:19.3026973Z         self,
2025-05-07T20:33:19.3027049Z         T: int,
2025-05-07T20:33:19.3027124Z         D: int,
2025-05-07T20:33:19.3027222Z         scale_ub: Optional[float],
2025-05-07T20:33:19.3027309Z         contiguous: bool,
2025-05-07T20:33:19.3027389Z         compiled: bool,
2025-05-07T20:33:19.3027512Z     ) -> None:
2025-05-07T20:33:19.3027604Z         torch.manual_seed(2025)
2025-05-07T20:33:19.3027675Z     
2025-05-07T20:33:19.3027843Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.3029623Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.3029666Z 
2025-05-07T20:33:19.3029782Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.3029786Z 
2025-05-07T20:33:19.3029881Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.3030139Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.3030214Z     T=16384,
2025-05-07T20:33:19.3030285Z     D=7168,
2025-05-07T20:33:19.3030371Z     scale_ub=None,
2025-05-07T20:33:19.3030456Z     contiguous=True,
2025-05-07T20:33:19.3030538Z     compiled=False,
2025-05-07T20:33:19.3030612Z )
2025-05-07T20:33:19.3030820Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.3030988Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:19.3030992Z 
2025-05-07T20:33:19.3031066Z     @given(
2025-05-07T20:33:19.3031178Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.3031275Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.3031384Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.3031496Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.3031608Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.3031680Z     )
2025-05-07T20:33:19.3031918Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.3032016Z     def test_silu_mul_quant(
2025-05-07T20:33:19.3032091Z         self,
2025-05-07T20:33:19.3032169Z         T: int,
2025-05-07T20:33:19.3032247Z         D: int,
2025-05-07T20:33:19.3032340Z         scale_ub: Optional[float],
2025-05-07T20:33:19.3032424Z         contiguous: bool,
2025-05-07T20:33:19.3032510Z         compiled: bool,
2025-05-07T20:33:19.3032583Z     ) -> None:
2025-05-07T20:33:19.3032673Z         torch.manual_seed(2025)
2025-05-07T20:33:19.3032742Z     
2025-05-07T20:33:19.3032901Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.3034690Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.3034701Z 
2025-05-07T20:33:19.3034813Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.3034818Z 
2025-05-07T20:33:19.3034924Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.3035141Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.3035213Z     T=16384,
2025-05-07T20:33:19.3035293Z     D=7168,
2025-05-07T20:33:19.3035370Z     scale_ub=1200.0,
2025-05-07T20:33:19.3035453Z     contiguous=True,
2025-05-07T20:33:19.3035537Z     compiled=False,
2025-05-07T20:33:19.3035606Z )
2025-05-07T20:33:19.3035819Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.3036007Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:19.3036013Z 
2025-05-07T20:33:19.3036089Z     @given(
2025-05-07T20:33:19.3036233Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.3036482Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.3036592Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.3036710Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.3036819Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.3036890Z     )
2025-05-07T20:33:19.3037132Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.3037222Z     def test_silu_mul_quant(
2025-05-07T20:33:19.3037299Z         self,
2025-05-07T20:33:19.3037373Z         T: int,
2025-05-07T20:33:19.3037449Z         D: int,
2025-05-07T20:33:19.3037547Z         scale_ub: Optional[float],
2025-05-07T20:33:19.3037630Z         contiguous: bool,
2025-05-07T20:33:19.3037778Z         compiled: bool,
2025-05-07T20:33:19.3037858Z     ) -> None:
2025-05-07T20:33:19.3037948Z         torch.manual_seed(2025)
2025-05-07T20:33:19.3038020Z     
2025-05-07T20:33:19.3038193Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.3039933Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.3039939Z 
2025-05-07T20:33:19.3040385Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.3040401Z 
2025-05-07T20:33:19.3040538Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.3040763Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.3040843Z     T=128,
2025-05-07T20:33:19.3040920Z     D=5120,
2025-05-07T20:33:19.3041012Z     scale_ub=1200.0,
2025-05-07T20:33:19.3041097Z     contiguous=False,
2025-05-07T20:33:19.3041180Z     compiled=False,
2025-05-07T20:33:19.3041257Z )
2025-05-07T20:33:19.3041470Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.3041637Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:19.3041641Z 
2025-05-07T20:33:19.3041723Z     @given(
2025-05-07T20:33:19.3041838Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.3041938Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.3042048Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.3042162Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.3042278Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.3042351Z     )
2025-05-07T20:33:19.3042680Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.3042781Z     def test_silu_mul_quant(
2025-05-07T20:33:19.3042856Z         self,
2025-05-07T20:33:19.3042931Z         T: int,
2025-05-07T20:33:19.3043010Z         D: int,
2025-05-07T20:33:19.3043105Z         scale_ub: Optional[float],
2025-05-07T20:33:19.3043192Z         contiguous: bool,
2025-05-07T20:33:19.3043281Z         compiled: bool,
2025-05-07T20:33:19.3043357Z     ) -> None:
2025-05-07T20:33:19.3043454Z         torch.manual_seed(2025)
2025-05-07T20:33:19.3043526Z     
2025-05-07T20:33:19.3043688Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.3043763Z     
2025-05-07T20:33:19.3043854Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.3043976Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.3044067Z         x = x_sign * x_clamp
2025-05-07T20:33:19.3044145Z         x0 = x[:, :D]
2025-05-07T20:33:19.3044223Z         x1 = x[:, D:]
2025-05-07T20:33:19.3044304Z     
2025-05-07T20:33:19.3044388Z         if contiguous:
2025-05-07T20:33:19.3044593Z             x0 = x0.contiguous()
2025-05-07T20:33:19.3044684Z             x1 = x1.contiguous()
2025-05-07T20:33:19.3044755Z     
2025-05-07T20:33:19.3044851Z         if scale_ub is not None:
2025-05-07T20:33:19.3044952Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.3045085Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.3045164Z             )
2025-05-07T20:33:19.3045238Z         else:
2025-05-07T20:33:19.3045331Z             scale_ub_tensor = None
2025-05-07T20:33:19.3045408Z     
2025-05-07T20:33:19.3045536Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.3045620Z             op = silu_mul_quant
2025-05-07T20:33:19.3045705Z             if compiled:
2025-05-07T20:33:19.3045866Z                 op = torch.compile(op)
2025-05-07T20:33:19.3045967Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.3046041Z     
2025-05-07T20:33:19.3046133Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.3046140Z 
2025-05-07T20:33:19.3046238Z moe/activation_test.py:117: 
2025-05-07T20:33:19.3046364Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.3046464Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.3046565Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.3047061Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.3047155Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.3047517Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.3047738Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.3048086Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.3048181Z     kernel = self.compile(
2025-05-07T20:33:19.3048585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.3048764Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.3048889Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.3048894Z 
2025-05-07T20:33:19.3049098Z self = <triton.compiler.compiler.ASTSource object at 0x7f359be289d0>
2025-05-07T20:33:19.3049871Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.3050365Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f359bbf11c0>}
2025-05-07T20:33:19.3051195Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.3051387Z context = <triton._C.libtriton.ir.context object at 0x7f359b8094f0>
2025-05-07T20:33:19.3051392Z 
2025-05-07T20:33:19.3051552Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.3051809Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.3051912Z                            module_map=module_map)
2025-05-07T20:33:19.3052074Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.3052167Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.3052242Z E       ^
2025-05-07T20:33:19.3052596Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.3052600Z 
2025-05-07T20:33:19.3053089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.3053132Z 
2025-05-07T20:33:19.3053239Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.3053457Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.3053531Z     T=2048,
2025-05-07T20:33:19.3053607Z     D=7168,
2025-05-07T20:33:19.3053685Z     scale_ub=None,
2025-05-07T20:33:19.3053774Z     contiguous=False,
2025-05-07T20:33:19.3053854Z     compiled=False,
2025-05-07T20:33:19.3053923Z )
2025-05-07T20:33:19.3054142Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.3054313Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:19.3054359Z 
2025-05-07T20:33:19.3054433Z     @given(
2025-05-07T20:33:19.3054551Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.3054652Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.3054763Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.3054885Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.3054994Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.3055071Z     )
2025-05-07T20:33:19.3055309Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.3055400Z     def test_silu_mul_quant(
2025-05-07T20:33:19.3055479Z         self,
2025-05-07T20:33:19.3055560Z         T: int,
2025-05-07T20:33:19.3055636Z         D: int,
2025-05-07T20:33:19.3055738Z         scale_ub: Optional[float],
2025-05-07T20:33:19.3055825Z         contiguous: bool,
2025-05-07T20:33:19.3055905Z         compiled: bool,
2025-05-07T20:33:19.3055985Z     ) -> None:
2025-05-07T20:33:19.3056080Z         torch.manual_seed(2025)
2025-05-07T20:33:19.3056150Z     
2025-05-07T20:33:19.3056317Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.3058079Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.3058087Z 
2025-05-07T20:33:19.3058201Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.3058206Z 
2025-05-07T20:33:19.3058303Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.3058526Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.3058603Z     T=128,
2025-05-07T20:33:19.3058676Z     D=7168,
2025-05-07T20:33:19.3058759Z     scale_ub=1200.0,
2025-05-07T20:33:19.3058883Z     contiguous=True,
2025-05-07T20:33:19.3058970Z     compiled=True,
2025-05-07T20:33:19.3059046Z )
2025-05-07T20:33:19.3059258Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.3059420Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:19.3059427Z 
2025-05-07T20:33:19.3059502Z     @given(
2025-05-07T20:33:19.3059616Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.3059713Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.3059824Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.3059936Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.3060050Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.3060122Z     )
2025-05-07T20:33:19.3060362Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.3060458Z     def test_silu_mul_quant(
2025-05-07T20:33:19.3060538Z         self,
2025-05-07T20:33:19.3060614Z         T: int,
2025-05-07T20:33:19.3060777Z         D: int,
2025-05-07T20:33:19.3060873Z         scale_ub: Optional[float],
2025-05-07T20:33:19.3060965Z         contiguous: bool,
2025-05-07T20:33:19.3061052Z         compiled: bool,
2025-05-07T20:33:19.3061127Z     ) -> None:
2025-05-07T20:33:19.3061220Z         torch.manual_seed(2025)
2025-05-07T20:33:19.3061292Z     
2025-05-07T20:33:19.3061454Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.3061530Z     
2025-05-07T20:33:19.3061619Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.3061741Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.3061831Z         x = x_sign * x_clamp
2025-05-07T20:33:19.3061909Z         x0 = x[:, :D]
2025-05-07T20:33:19.3062031Z         x1 = x[:, D:]
2025-05-07T20:33:19.3062105Z     
2025-05-07T20:33:19.3062188Z         if contiguous:
2025-05-07T20:33:19.3062280Z             x0 = x0.contiguous()
2025-05-07T20:33:19.3062374Z             x1 = x1.contiguous()
2025-05-07T20:33:19.3062446Z     
2025-05-07T20:33:19.3062539Z         if scale_ub is not None:
2025-05-07T20:33:19.3062642Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:19.3062774Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:19.3062850Z             )
2025-05-07T20:33:19.3062925Z         else:
2025-05-07T20:33:19.3063016Z             scale_ub_tensor = None
2025-05-07T20:33:19.3063089Z     
2025-05-07T20:33:19.3063216Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:19.3063303Z             op = silu_mul_quant
2025-05-07T20:33:19.3063390Z             if compiled:
2025-05-07T20:33:19.3067018Z                 op = torch.compile(op)
2025-05-07T20:33:19.3067136Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.3067217Z     
2025-05-07T20:33:19.3067308Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:19.3067313Z 
2025-05-07T20:33:19.3067473Z moe/activation_test.py:117: 
2025-05-07T20:33:19.3067606Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.3067709Z moe/activation_test.py:115: in fn
2025-05-07T20:33:19.3067808Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:19.3068174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:19.3068266Z     return fn(*args, **kwargs)
2025-05-07T20:33:19.3068757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:19.3068852Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:19.3069213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:19.3069436Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:19.3069841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:19.3069942Z     kernel = self.compile(
2025-05-07T20:33:19.3070328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:19.3070499Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:19.3070631Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:19.3070636Z 
2025-05-07T20:33:19.3070840Z self = <triton.compiler.compiler.ASTSource object at 0x7f38d8330750>
2025-05-07T20:33:19.3071613Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:19.3072113Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f3946777ce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f359b85fb00>}
2025-05-07T20:33:19.3072951Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:19.3073142Z context = <triton._C.libtriton.ir.context object at 0x7f359bad1830>
2025-05-07T20:33:19.3073147Z 
2025-05-07T20:33:19.3073307Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:19.3073569Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:19.3073673Z                            module_map=module_map)
2025-05-07T20:33:19.3073833Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:19.3073928Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:19.3074043Z E       ^
2025-05-07T20:33:19.3074397Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:19.3074404Z 
2025-05-07T20:33:19.3074838Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:19.3074842Z 
2025-05-07T20:33:19.3074949Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.3075164Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.3075241Z     T=128,
2025-05-07T20:33:19.3075324Z     D=7168,
2025-05-07T20:33:19.3075408Z     scale_ub=1200.0,
2025-05-07T20:33:19.3075494Z     contiguous=True,
2025-05-07T20:33:19.3075579Z     compiled=False,
2025-05-07T20:33:19.3075652Z )
2025-05-07T20:33:19.3075866Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.3076039Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:19.3076047Z 
2025-05-07T20:33:19.3076118Z     @given(
2025-05-07T20:33:19.3076236Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.3076339Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.3076457Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.3076575Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.3076686Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.3076758Z     )
2025-05-07T20:33:19.3077003Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.3077093Z     def test_silu_mul_quant(
2025-05-07T20:33:19.3077167Z         self,
2025-05-07T20:33:19.3077247Z         T: int,
2025-05-07T20:33:19.3077321Z         D: int,
2025-05-07T20:33:19.3077415Z         scale_ub: Optional[float],
2025-05-07T20:33:19.3077506Z         contiguous: bool,
2025-05-07T20:33:19.3077588Z         compiled: bool,
2025-05-07T20:33:19.3077669Z     ) -> None:
2025-05-07T20:33:19.3077760Z         torch.manual_seed(2025)
2025-05-07T20:33:19.3077827Z     
2025-05-07T20:33:19.3078043Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.3078120Z     
2025-05-07T20:33:19.3078211Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.3078337Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.3080083Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.3080092Z 
2025-05-07T20:33:19.3080210Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:19.3080215Z 
2025-05-07T20:33:19.3080316Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.3080575Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.3080691Z     T=128,
2025-05-07T20:33:19.3080767Z     D=5120,
2025-05-07T20:33:19.3080853Z     scale_ub=1200.0,
2025-05-07T20:33:19.3080934Z     contiguous=True,
2025-05-07T20:33:19.3081016Z     compiled=True,
2025-05-07T20:33:19.3081091Z )
2025-05-07T20:33:19.3081303Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.3081466Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:19.3081471Z 
2025-05-07T20:33:19.3081547Z     @given(
2025-05-07T20:33:19.3081663Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.3081758Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.3081917Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.3082028Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.3082143Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.3082215Z     )
2025-05-07T20:33:19.3082458Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.3082551Z     def test_silu_mul_quant(
2025-05-07T20:33:19.3082622Z         self,
2025-05-07T20:33:19.3082696Z         T: int,
2025-05-07T20:33:19.3082773Z         D: int,
2025-05-07T20:33:19.3082867Z         scale_ub: Optional[float],
2025-05-07T20:33:19.3082953Z         contiguous: bool,
2025-05-07T20:33:19.3083037Z         compiled: bool,
2025-05-07T20:33:19.3083112Z     ) -> None:
2025-05-07T20:33:19.3083205Z         torch.manual_seed(2025)
2025-05-07T20:33:19.3083280Z     
2025-05-07T20:33:19.3083440Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.3083511Z     
2025-05-07T20:33:19.3083605Z         x_sign = torch.sign(x)
2025-05-07T20:33:19.3083724Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:19.3085470Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.3085478Z 
2025-05-07T20:33:19.3085592Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:19.3085597Z 
2025-05-07T20:33:19.3085699Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:19.3085916Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:19.3085991Z     T=128,
2025-05-07T20:33:19.3086068Z     D=7168,
2025-05-07T20:33:19.3086147Z     scale_ub=None,
2025-05-07T20:33:19.3086228Z     contiguous=True,
2025-05-07T20:33:19.3086363Z     compiled=True,
2025-05-07T20:33:19.3086437Z )
2025-05-07T20:33:19.3086654Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:19.3086818Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:19.3086822Z 
2025-05-07T20:33:19.3086897Z     @given(
2025-05-07T20:33:19.3087015Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:19.3087112Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:19.3087223Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:19.3087342Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:19.3087451Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:19.3087525Z     )
2025-05-07T20:33:19.3087765Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:19.3087858Z     def test_silu_mul_quant(
2025-05-07T20:33:19.3087934Z         self,
2025-05-07T20:33:19.3088013Z         T: int,
2025-05-07T20:33:19.3088089Z         D: int,
2025-05-07T20:33:19.3088263Z         scale_ub: Optional[float],
2025-05-07T20:33:19.3088352Z         contiguous: bool,
2025-05-07T20:33:19.3088435Z         compiled: bool,
2025-05-07T20:33:19.3088512Z     ) -> None:
2025-05-07T20:33:19.3088602Z         torch.manual_seed(2025)
2025-05-07T20:33:19.3088673Z     
2025-05-07T20:33:19.3088836Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:19.3090572Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:19.3090619Z 
2025-05-07T20:33:19.3090737Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:19.3090867Z =============================== warnings summary ===============================
2025-05-07T20:33:19.3091170Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:19.3091472Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:19.3091763Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:19.3092631Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:33:19.3092861Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:33:19.3092868Z 
2025-05-07T20:33:19.3093080Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:33:19.3093242Z ================= 1 failed, 1 deselected, 3 warnings in 12.16s =================
2025-05-07T20:33:20.9496899Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:33:21.0130949Z [EXEC] [ATTEMPT 1/2] Command attempt failed.
2025-05-07T20:33:21.0131584Z 
2025-05-07T20:33:23.0149338Z [EXEC] [ATTEMPT 2/2]    + conda run --no-capture-output -n build_binary python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py
2025-05-07T20:33:25.1752615Z ============================= test session starts ==============================
2025-05-07T20:33:25.1753553Z platform linux -- Python 3.13.0, pytest-8.3.5, pluggy-1.5.0 -- /home/ec2-user/miniconda/envs/build_binary/bin/python
2025-05-07T20:33:25.1754112Z cachedir: .pytest_cache
2025-05-07T20:33:25.1754683Z hypothesis profile 'ci' -> database=None, deadline=None, print_blob=True, derandomize=True, suppress_health_check=(HealthCheck.too_slow,)
2025-05-07T20:33:25.1755418Z rootdir: /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu
2025-05-07T20:33:25.1755829Z plugins: hypothesis-6.131.14
2025-05-07T20:33:26.7362138Z TMA benchmarks will be running with experimental grid constant TMA descriptor.
2025-05-07T20:33:26.8323177Z collecting ... collected 2 items / 1 deselected / 1 selected
2025-05-07T20:33:26.8323578Z run-last-failure: rerun previous 1 failure
2025-05-07T20:33:26.8323789Z 
2025-05-07T20:33:28.9544690Z moe/activation_test.py::ActivationTests::test_silu_mul_quant Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.9545686Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.9546121Z     T=1,
2025-05-07T20:33:28.9546603Z     D=5120,
2025-05-07T20:33:28.9546873Z     scale_ub=None,
2025-05-07T20:33:28.9547090Z     contiguous=True,
2025-05-07T20:33:28.9547347Z     compiled=True,
2025-05-07T20:33:28.9547622Z )
2025-05-07T20:33:28.9547941Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.9548423Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:28.9548690Z 
2025-05-07T20:33:28.9548769Z     @given(
2025-05-07T20:33:28.9549006Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.9549314Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.9549619Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.9549949Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.9550369Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.9550657Z     )
2025-05-07T20:33:28.9551006Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.9551455Z     def test_silu_mul_quant(
2025-05-07T20:33:28.9551702Z         self,
2025-05-07T20:33:28.9551894Z         T: int,
2025-05-07T20:33:28.9552084Z         D: int,
2025-05-07T20:33:28.9552299Z         scale_ub: Optional[float],
2025-05-07T20:33:28.9552565Z         contiguous: bool,
2025-05-07T20:33:28.9552795Z         compiled: bool,
2025-05-07T20:33:28.9553027Z     ) -> None:
2025-05-07T20:33:28.9553240Z         torch.manual_seed(2025)
2025-05-07T20:33:28.9553478Z     
2025-05-07T20:33:28.9553740Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.9554085Z     
2025-05-07T20:33:28.9554277Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.9554560Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.9554870Z         x = x_sign * x_clamp
2025-05-07T20:33:28.9555112Z         x0 = x[:, :D]
2025-05-07T20:33:28.9555318Z         x1 = x[:, D:]
2025-05-07T20:33:28.9555527Z     
2025-05-07T20:33:28.9555714Z         if contiguous:
2025-05-07T20:33:28.9555941Z             x0 = x0.contiguous()
2025-05-07T20:33:28.9556200Z             x1 = x1.contiguous()
2025-05-07T20:33:28.9556435Z     
2025-05-07T20:33:28.9556617Z         if scale_ub is not None:
2025-05-07T20:33:28.9556888Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.9557218Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.9557516Z             )
2025-05-07T20:33:28.9557708Z         else:
2025-05-07T20:33:28.9557916Z             scale_ub_tensor = None
2025-05-07T20:33:28.9558168Z     
2025-05-07T20:33:28.9558390Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.9558698Z             op = silu_mul_quant
2025-05-07T20:33:28.9558945Z             if compiled:
2025-05-07T20:33:28.9559190Z                 op = torch.compile(op)
2025-05-07T20:33:28.9559486Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.9559757Z     
2025-05-07T20:33:28.9560034Z         y_fp8, y_scale = fn()
2025-05-07T20:33:28.9560325Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:28.9560611Z     
2025-05-07T20:33:28.9560840Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.9561174Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:28.9561460Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:28.9561765Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:28.9562126Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:28.9562431Z     
2025-05-07T20:33:28.9562635Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:28.9562827Z 
2025-05-07T20:33:28.9562927Z moe/activation_test.py:126: 
2025-05-07T20:33:28.9563225Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.9563557Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:28.9563880Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:28.9564711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:28.9565491Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:28.9566035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.9566703Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.9567389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:28.9568102Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:28.9568879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:28.9569586Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:28.9570199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:28.9570706Z     fn()
2025-05-07T20:33:28.9571230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:28.9571832Z     self.fn.run(
2025-05-07T20:33:28.9572315Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.9572846Z     kernel = self.compile(
2025-05-07T20:33:28.9573388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.9574055Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.9574439Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.9574677Z 
2025-05-07T20:33:28.9574881Z self = <triton.compiler.compiler.ASTSource object at 0x7f89f85b2270>
2025-05-07T20:33:28.9575962Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.9577337Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89f5d3a700>}
2025-05-07T20:33:28.9578673Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.9579672Z context = <triton._C.libtriton.ir.context object at 0x7f89f60bf530>
2025-05-07T20:33:28.9579969Z 
2025-05-07T20:33:28.9580138Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.9580703Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.9581166Z                            module_map=module_map)
2025-05-07T20:33:28.9581519Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.9581878Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:28.9582144Z E       ^
2025-05-07T20:33:28.9582592Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.9583050Z 
2025-05-07T20:33:28.9583473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:28.9583978Z 
2025-05-07T20:33:28.9584079Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:28.9584490Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:28.9584899Z     T=2048,
2025-05-07T20:33:28.9585084Z     D=5120,
2025-05-07T20:33:28.9585274Z     scale_ub=1200.0,
2025-05-07T20:33:28.9585495Z     contiguous=True,
2025-05-07T20:33:28.9585719Z     compiled=False,
2025-05-07T20:33:28.9586008Z )
2025-05-07T20:33:28.9586323Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:28.9586811Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:28.9587075Z 
2025-05-07T20:33:28.9587159Z     @given(
2025-05-07T20:33:28.9587392Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:28.9587751Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:28.9588051Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:28.9588374Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:28.9588688Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:28.9588974Z     )
2025-05-07T20:33:28.9589363Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:28.9589806Z     def test_silu_mul_quant(
2025-05-07T20:33:28.9590046Z         self,
2025-05-07T20:33:28.9590241Z         T: int,
2025-05-07T20:33:28.9590431Z         D: int,
2025-05-07T20:33:28.9590652Z         scale_ub: Optional[float],
2025-05-07T20:33:28.9590919Z         contiguous: bool,
2025-05-07T20:33:28.9591152Z         compiled: bool,
2025-05-07T20:33:28.9591370Z     ) -> None:
2025-05-07T20:33:28.9591588Z         torch.manual_seed(2025)
2025-05-07T20:33:28.9591831Z     
2025-05-07T20:33:28.9592096Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:28.9592443Z     
2025-05-07T20:33:28.9592638Z         x_sign = torch.sign(x)
2025-05-07T20:33:28.9592923Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:28.9593236Z         x = x_sign * x_clamp
2025-05-07T20:33:28.9593478Z         x0 = x[:, :D]
2025-05-07T20:33:28.9593688Z         x1 = x[:, D:]
2025-05-07T20:33:28.9593904Z     
2025-05-07T20:33:28.9594087Z         if contiguous:
2025-05-07T20:33:28.9594309Z             x0 = x0.contiguous()
2025-05-07T20:33:28.9594568Z             x1 = x1.contiguous()
2025-05-07T20:33:28.9594810Z     
2025-05-07T20:33:28.9594996Z         if scale_ub is not None:
2025-05-07T20:33:28.9595270Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:28.9595605Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:28.9595911Z             )
2025-05-07T20:33:28.9596107Z         else:
2025-05-07T20:33:28.9596311Z             scale_ub_tensor = None
2025-05-07T20:33:28.9596554Z     
2025-05-07T20:33:28.9596781Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:28.9597085Z             op = silu_mul_quant
2025-05-07T20:33:28.9597340Z             if compiled:
2025-05-07T20:33:28.9597578Z                 op = torch.compile(op)
2025-05-07T20:33:28.9597870Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.9598148Z     
2025-05-07T20:33:28.9598331Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:28.9598497Z 
2025-05-07T20:33:28.9598596Z moe/activation_test.py:117: 
2025-05-07T20:33:28.9598940Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.9599265Z moe/activation_test.py:115: in fn
2025-05-07T20:33:28.9599547Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:28.9600254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:28.9600931Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:28.9601459Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:28.9602132Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:28.9602786Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:28.9603304Z     kernel = self.compile(
2025-05-07T20:33:28.9603853Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:28.9604539Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:28.9604967Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:28.9605188Z 
2025-05-07T20:33:28.9605389Z self = <triton.compiler.compiler.ASTSource object at 0x7f89f5ccd090>
2025-05-07T20:33:28.9606447Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:28.9607798Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89f5bf2020>}
2025-05-07T20:33:28.9609202Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:28.9610214Z context = <triton._C.libtriton.ir.context object at 0x7f89f615ff30>
2025-05-07T20:33:28.9610496Z 
2025-05-07T20:33:28.9610660Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:28.9611174Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:28.9611634Z                            module_map=module_map)
2025-05-07T20:33:28.9611992Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:28.9612343Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:28.9612605Z E       ^
2025-05-07T20:33:28.9613068Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:28.9613516Z 
2025-05-07T20:33:28.9613942Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:29.6170608Z 
2025-05-07T20:33:29.6171020Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:29.6171759Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:29.6172405Z     T=2048,
2025-05-07T20:33:29.6172696Z     D=5120,
2025-05-07T20:33:29.6172987Z     scale_ub=1200.0,
2025-05-07T20:33:29.6173317Z     contiguous=True,
2025-05-07T20:33:29.6173659Z     compiled=True,
2025-05-07T20:33:29.6173975Z )
2025-05-07T20:33:29.6174462Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:29.6175301Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:29.6175710Z 
2025-05-07T20:33:29.6175820Z     @given(
2025-05-07T20:33:29.6176186Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:29.6176702Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:29.6177192Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:29.6178042Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:29.6178569Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:29.6179078Z     )
2025-05-07T20:33:29.6179627Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:29.6180367Z     def test_silu_mul_quant(
2025-05-07T20:33:29.6180753Z         self,
2025-05-07T20:33:29.6181050Z         T: int,
2025-05-07T20:33:29.6181363Z         D: int,
2025-05-07T20:33:29.6181694Z         scale_ub: Optional[float],
2025-05-07T20:33:29.6182114Z         contiguous: bool,
2025-05-07T20:33:29.6182499Z         compiled: bool,
2025-05-07T20:33:29.6182862Z     ) -> None:
2025-05-07T20:33:29.6183192Z         torch.manual_seed(2025)
2025-05-07T20:33:29.6183570Z     
2025-05-07T20:33:29.6183997Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:29.6184549Z     
2025-05-07T20:33:29.6184842Z         x_sign = torch.sign(x)
2025-05-07T20:33:29.6185299Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:29.6185798Z         x = x_sign * x_clamp
2025-05-07T20:33:29.6186392Z         x0 = x[:, :D]
2025-05-07T20:33:29.6186723Z         x1 = x[:, D:]
2025-05-07T20:33:29.6187050Z     
2025-05-07T20:33:29.6187327Z         if contiguous:
2025-05-07T20:33:29.6187792Z             x0 = x0.contiguous()
2025-05-07T20:33:29.6188197Z             x1 = x1.contiguous()
2025-05-07T20:33:29.6188581Z     
2025-05-07T20:33:29.6188886Z         if scale_ub is not None:
2025-05-07T20:33:29.6189324Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:29.6189848Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:29.6190347Z             )
2025-05-07T20:33:29.6190652Z         else:
2025-05-07T20:33:29.6190968Z             scale_ub_tensor = None
2025-05-07T20:33:29.6191325Z     
2025-05-07T20:33:29.6191871Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:29.6192334Z             op = silu_mul_quant
2025-05-07T20:33:29.6192711Z             if compiled:
2025-05-07T20:33:29.6193093Z                 op = torch.compile(op)
2025-05-07T20:33:29.6193543Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:29.6193966Z     
2025-05-07T20:33:29.6194273Z         y_fp8, y_scale = fn()
2025-05-07T20:33:29.6194712Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:29.6195188Z     
2025-05-07T20:33:29.6195578Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:29.6196145Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:29.6196633Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:29.6197127Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:29.6197681Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:29.6198183Z     
2025-05-07T20:33:29.6198492Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:29.6198801Z 
2025-05-07T20:33:29.6198961Z moe/activation_test.py:126: 
2025-05-07T20:33:29.6199436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:29.6200005Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:29.6200527Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:29.6201925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:29.6203243Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:29.6204180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:29.6205281Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:29.6206430Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:29.6207615Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:29.6208927Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:29.6209993Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:29.6210998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:29.6211857Z     fn()
2025-05-07T20:33:29.6212691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:29.6222720Z     self.fn.run(
2025-05-07T20:33:29.6223597Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:29.6224533Z     kernel = self.compile(
2025-05-07T20:33:29.6225469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:29.6226631Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:29.6227313Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:29.6227891Z 
2025-05-07T20:33:29.6228297Z self = <triton.compiler.compiler.ASTSource object at 0x7f89f5cce0d0>
2025-05-07T20:33:29.6230049Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:29.6232487Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89f4acf560>}
2025-05-07T20:33:29.6234930Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:29.6236840Z context = <triton._C.libtriton.ir.context object at 0x7f89f48a2530>
2025-05-07T20:33:29.6237354Z 
2025-05-07T20:33:29.6237624Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:29.6238497Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:29.6239303Z                            module_map=module_map)
2025-05-07T20:33:29.6239898Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:29.6240702Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:29.6241145Z E       ^
2025-05-07T20:33:29.6241940Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:29.6242746Z 
2025-05-07T20:33:29.6243482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:29.6244414Z 
2025-05-07T20:33:29.6244573Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:29.6245284Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:29.6245978Z     T=16384,
2025-05-07T20:33:29.6246279Z     D=7168,
2025-05-07T20:33:29.6246590Z     scale_ub=1200.0,
2025-05-07T20:33:29.6246945Z     contiguous=False,
2025-05-07T20:33:29.6247305Z     compiled=False,
2025-05-07T20:33:29.6247633Z )
2025-05-07T20:33:29.6248159Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:29.6249010Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:29.6249501Z 
2025-05-07T20:33:29.6249631Z     @given(
2025-05-07T20:33:29.6249996Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:29.6250512Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:29.6251023Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:29.6251577Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:29.6252134Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:29.6252604Z     )
2025-05-07T20:33:29.6253338Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:29.6254113Z     def test_silu_mul_quant(
2025-05-07T20:33:29.6254498Z         self,
2025-05-07T20:33:29.6254813Z         T: int,
2025-05-07T20:33:29.6255127Z         D: int,
2025-05-07T20:33:29.6255468Z         scale_ub: Optional[float],
2025-05-07T20:33:29.6255914Z         contiguous: bool,
2025-05-07T20:33:29.6256307Z         compiled: bool,
2025-05-07T20:33:29.6256658Z     ) -> None:
2025-05-07T20:33:29.6257006Z         torch.manual_seed(2025)
2025-05-07T20:33:29.6257406Z     
2025-05-07T20:33:29.6257838Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:29.6258382Z     
2025-05-07T20:33:29.6258682Z         x_sign = torch.sign(x)
2025-05-07T20:33:29.6259054Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:29.6259461Z         x = x_sign * x_clamp
2025-05-07T20:33:29.6259772Z         x0 = x[:, :D]
2025-05-07T20:33:29.6260081Z         x1 = x[:, D:]
2025-05-07T20:33:29.6260364Z     
2025-05-07T20:33:29.6260863Z         if contiguous:
2025-05-07T20:33:29.6261201Z             x0 = x0.contiguous()
2025-05-07T20:33:29.6261562Z             x1 = x1.contiguous()
2025-05-07T20:33:29.6261910Z     
2025-05-07T20:33:29.6262178Z         if scale_ub is not None:
2025-05-07T20:33:29.6262564Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:29.6263085Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:29.6263562Z             )
2025-05-07T20:33:29.6263835Z         else:
2025-05-07T20:33:29.6264141Z             scale_ub_tensor = None
2025-05-07T20:33:29.6264532Z     
2025-05-07T20:33:29.6264859Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:29.6265331Z             op = silu_mul_quant
2025-05-07T20:33:29.6265889Z             if compiled:
2025-05-07T20:33:29.6266286Z                 op = torch.compile(op)
2025-05-07T20:33:29.6266774Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:29.6267235Z     
2025-05-07T20:33:29.6267670Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:29.6267963Z 
2025-05-07T20:33:29.6268124Z moe/activation_test.py:117: 
2025-05-07T20:33:29.6268611Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:29.6269161Z moe/activation_test.py:115: in fn
2025-05-07T20:33:29.6269614Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:29.6270827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:29.6272054Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:29.6272977Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:29.6274180Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:29.6275359Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:29.6276288Z     kernel = self.compile(
2025-05-07T20:33:29.6277226Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:29.6278300Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:29.6278894Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:29.6279233Z 
2025-05-07T20:33:29.6279548Z self = <triton.compiler.compiler.ASTSource object at 0x7f89f4a69220>
2025-05-07T20:33:29.6281392Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:29.6283910Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89f4d787c0>}
2025-05-07T20:33:29.6286433Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:29.6288278Z context = <triton._C.libtriton.ir.context object at 0x7f89f48c9cb0>
2025-05-07T20:33:29.6288828Z 
2025-05-07T20:33:29.6289107Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:29.6290007Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:29.6290814Z                            module_map=module_map)
2025-05-07T20:33:29.6291416Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:29.6291994Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:29.6292425Z E       ^
2025-05-07T20:33:29.6293221Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:29.6294038Z 
2025-05-07T20:33:29.6294915Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:30.3244265Z 
2025-05-07T20:33:30.3244749Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:30.3245194Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:30.3245606Z     T=1,
2025-05-07T20:33:30.3245789Z     D=7168,
2025-05-07T20:33:30.3245979Z     scale_ub=None,
2025-05-07T20:33:30.3246185Z     contiguous=True,
2025-05-07T20:33:30.3246407Z     compiled=True,
2025-05-07T20:33:30.3246619Z )
2025-05-07T20:33:30.3246936Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:30.3247417Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:30.3247996Z 
2025-05-07T20:33:30.3248073Z     @given(
2025-05-07T20:33:30.3248308Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:30.3248625Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:30.3248944Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:30.3249277Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:30.3249594Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:30.3249879Z     )
2025-05-07T20:33:30.3250231Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:30.3250673Z     def test_silu_mul_quant(
2025-05-07T20:33:30.3250910Z         self,
2025-05-07T20:33:30.3251106Z         T: int,
2025-05-07T20:33:30.3251291Z         D: int,
2025-05-07T20:33:30.3251508Z         scale_ub: Optional[float],
2025-05-07T20:33:30.3251775Z         contiguous: bool,
2025-05-07T20:33:30.3252009Z         compiled: bool,
2025-05-07T20:33:30.3252229Z     ) -> None:
2025-05-07T20:33:30.3252439Z         torch.manual_seed(2025)
2025-05-07T20:33:30.3252675Z     
2025-05-07T20:33:30.3252937Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:30.3253272Z     
2025-05-07T20:33:30.3253463Z         x_sign = torch.sign(x)
2025-05-07T20:33:30.3253741Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:30.3254043Z         x = x_sign * x_clamp
2025-05-07T20:33:30.3254282Z         x0 = x[:, :D]
2025-05-07T20:33:30.3254485Z         x1 = x[:, D:]
2025-05-07T20:33:30.3254687Z     
2025-05-07T20:33:30.3254863Z         if contiguous:
2025-05-07T20:33:30.3255085Z             x0 = x0.contiguous()
2025-05-07T20:33:30.3255336Z             x1 = x1.contiguous()
2025-05-07T20:33:30.3255569Z     
2025-05-07T20:33:30.3255745Z         if scale_ub is not None:
2025-05-07T20:33:30.3256007Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:30.3256331Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:30.3256644Z             )
2025-05-07T20:33:30.3256830Z         else:
2025-05-07T20:33:30.3257036Z             scale_ub_tensor = None
2025-05-07T20:33:30.3257280Z     
2025-05-07T20:33:30.3257605Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:30.3257922Z             op = silu_mul_quant
2025-05-07T20:33:30.3258165Z             if compiled:
2025-05-07T20:33:30.3258436Z                 op = torch.compile(op)
2025-05-07T20:33:30.3258729Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:30.3258991Z     
2025-05-07T20:33:30.3259175Z         y_fp8, y_scale = fn()
2025-05-07T20:33:30.3259456Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:30.3259731Z     
2025-05-07T20:33:30.3259959Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:30.3260283Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:30.3260563Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:30.3260866Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:30.3261226Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:30.3261523Z     
2025-05-07T20:33:30.3261726Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:30.3262014Z 
2025-05-07T20:33:30.3262184Z moe/activation_test.py:126: 
2025-05-07T20:33:30.3262476Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:30.3262797Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:30.3263116Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:30.3263916Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:30.3264646Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:30.3265195Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:30.3265884Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:30.3266613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:30.3267319Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:30.3268127Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:30.3268774Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:30.3269365Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:30.3269862Z     fn()
2025-05-07T20:33:30.3270371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:30.3270958Z     self.fn.run(
2025-05-07T20:33:30.3271412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:30.3271941Z     kernel = self.compile(
2025-05-07T20:33:30.3272490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:30.3273135Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:30.3273525Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:30.3273747Z 
2025-05-07T20:33:30.3273952Z self = <triton.compiler.compiler.ASTSource object at 0x7f89f4a6b950>
2025-05-07T20:33:30.3275033Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:30.3276407Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89f4baa340>}
2025-05-07T20:33:30.3277873Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:30.3278935Z context = <triton._C.libtriton.ir.context object at 0x7f89f454d470>
2025-05-07T20:33:30.3279215Z 
2025-05-07T20:33:30.3279377Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:30.3279889Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:30.3280357Z                            module_map=module_map)
2025-05-07T20:33:30.3280712Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:30.3281071Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:30.3281338Z E       ^
2025-05-07T20:33:30.3281806Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:30.3282273Z 
2025-05-07T20:33:30.3282713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:30.3283221Z 
2025-05-07T20:33:30.3283409Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:30.3283819Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:30.3284204Z     T=4096,
2025-05-07T20:33:30.3284387Z     D=5120,
2025-05-07T20:33:30.3284573Z     scale_ub=None,
2025-05-07T20:33:30.3284775Z     contiguous=False,
2025-05-07T20:33:30.3284993Z     compiled=False,
2025-05-07T20:33:30.3285191Z )
2025-05-07T20:33:30.3285510Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:30.3285992Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:30.3286267Z 
2025-05-07T20:33:30.3286344Z     @given(
2025-05-07T20:33:30.3286567Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:30.3286925Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:30.3287224Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:30.3287548Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:30.3287866Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:30.3288145Z     )
2025-05-07T20:33:30.3288493Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:30.3288933Z     def test_silu_mul_quant(
2025-05-07T20:33:30.3289166Z         self,
2025-05-07T20:33:30.3289356Z         T: int,
2025-05-07T20:33:30.3289549Z         D: int,
2025-05-07T20:33:30.3289753Z         scale_ub: Optional[float],
2025-05-07T20:33:30.3290015Z         contiguous: bool,
2025-05-07T20:33:30.3290246Z         compiled: bool,
2025-05-07T20:33:30.3290455Z     ) -> None:
2025-05-07T20:33:30.3290659Z         torch.manual_seed(2025)
2025-05-07T20:33:30.3290891Z     
2025-05-07T20:33:30.3291149Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:30.3291483Z     
2025-05-07T20:33:30.3291673Z         x_sign = torch.sign(x)
2025-05-07T20:33:30.3291956Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:30.3292264Z         x = x_sign * x_clamp
2025-05-07T20:33:30.3292502Z         x0 = x[:, :D]
2025-05-07T20:33:30.3292709Z         x1 = x[:, D:]
2025-05-07T20:33:30.3292915Z     
2025-05-07T20:33:30.3293104Z         if contiguous:
2025-05-07T20:33:30.3293326Z             x0 = x0.contiguous()
2025-05-07T20:33:30.3293581Z             x1 = x1.contiguous()
2025-05-07T20:33:30.3293827Z     
2025-05-07T20:33:30.3294013Z         if scale_ub is not None:
2025-05-07T20:33:30.3294284Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:30.3294613Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:30.3294922Z             )
2025-05-07T20:33:30.3295111Z         else:
2025-05-07T20:33:30.3295321Z             scale_ub_tensor = None
2025-05-07T20:33:30.3295579Z     
2025-05-07T20:33:30.3295800Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:30.3296108Z             op = silu_mul_quant
2025-05-07T20:33:30.3296403Z             if compiled:
2025-05-07T20:33:30.3296641Z                 op = torch.compile(op)
2025-05-07T20:33:30.3296937Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:30.3297201Z     
2025-05-07T20:33:30.3297377Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:30.3297540Z 
2025-05-07T20:33:30.3297634Z moe/activation_test.py:117: 
2025-05-07T20:33:30.3297929Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:30.3298251Z moe/activation_test.py:115: in fn
2025-05-07T20:33:30.3298518Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:30.3299251Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:30.3299923Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:30.3300460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:30.3301136Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:30.3301886Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:30.3302409Z     kernel = self.compile(
2025-05-07T20:33:30.3302956Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:30.3303594Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:30.3303980Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:30.3304200Z 
2025-05-07T20:33:30.3304406Z self = <triton.compiler.compiler.ASTSource object at 0x7f89f4278b90>
2025-05-07T20:33:30.3305455Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:30.3306850Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89f4bab100>}
2025-05-07T20:33:30.3308274Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:30.3309276Z context = <triton._C.libtriton.ir.context object at 0x7f89f456f4b0>
2025-05-07T20:33:30.3309561Z 
2025-05-07T20:33:30.3309724Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:30.3310238Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:30.3310708Z                            module_map=module_map)
2025-05-07T20:33:30.3311069Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:30.3311414Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:30.3311668Z E       ^
2025-05-07T20:33:30.3312137Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:30.3312575Z 
2025-05-07T20:33:30.3313001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.0268473Z 
2025-05-07T20:33:31.0268981Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.0269591Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.0270005Z     T=4096,
2025-05-07T20:33:31.0270194Z     D=7168,
2025-05-07T20:33:31.0270379Z     scale_ub=None,
2025-05-07T20:33:31.0270582Z     contiguous=False,
2025-05-07T20:33:31.0270810Z     compiled=False,
2025-05-07T20:33:31.0271017Z )
2025-05-07T20:33:31.0271353Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.0272139Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:31.0272419Z 
2025-05-07T20:33:31.0272503Z     @given(
2025-05-07T20:33:31.0272733Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.0273030Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.0273327Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.0273655Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.0273966Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.0274253Z     )
2025-05-07T20:33:31.0274603Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.0275024Z     def test_silu_mul_quant(
2025-05-07T20:33:31.0275259Z         self,
2025-05-07T20:33:31.0275447Z         T: int,
2025-05-07T20:33:31.0275629Z         D: int,
2025-05-07T20:33:31.0275846Z         scale_ub: Optional[float],
2025-05-07T20:33:31.0276108Z         contiguous: bool,
2025-05-07T20:33:31.0276342Z         compiled: bool,
2025-05-07T20:33:31.0276565Z     ) -> None:
2025-05-07T20:33:31.0276780Z         torch.manual_seed(2025)
2025-05-07T20:33:31.0277181Z     
2025-05-07T20:33:31.0277442Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.0277778Z     
2025-05-07T20:33:31.0277992Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.0278286Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.0278598Z         x = x_sign * x_clamp
2025-05-07T20:33:31.0278853Z         x0 = x[:, :D]
2025-05-07T20:33:31.0279088Z         x1 = x[:, D:]
2025-05-07T20:33:31.0279292Z     
2025-05-07T20:33:31.0279469Z         if contiguous:
2025-05-07T20:33:31.0279697Z             x0 = x0.contiguous()
2025-05-07T20:33:31.0279950Z             x1 = x1.contiguous()
2025-05-07T20:33:31.0280175Z     
2025-05-07T20:33:31.0280362Z         if scale_ub is not None:
2025-05-07T20:33:31.0280726Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.0281051Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.0281360Z             )
2025-05-07T20:33:31.0281559Z         else:
2025-05-07T20:33:31.0281763Z             scale_ub_tensor = None
2025-05-07T20:33:31.0282012Z     
2025-05-07T20:33:31.0282248Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.0282547Z             op = silu_mul_quant
2025-05-07T20:33:31.0282797Z             if compiled:
2025-05-07T20:33:31.0283037Z                 op = torch.compile(op)
2025-05-07T20:33:31.0283331Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.0283599Z     
2025-05-07T20:33:31.0283781Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.0283942Z 
2025-05-07T20:33:31.0284047Z moe/activation_test.py:117: 
2025-05-07T20:33:31.0284329Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.0284657Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.0284937Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.0285644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.0286328Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.0286861Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.0287531Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.0288178Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.0288704Z     kernel = self.compile(
2025-05-07T20:33:31.0289240Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.0289892Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.0290271Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.0290498Z 
2025-05-07T20:33:31.0290788Z self = <triton.compiler.compiler.ASTSource object at 0x7f89f4aeb020>
2025-05-07T20:33:31.0291914Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.0293301Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89f4baa980>}
2025-05-07T20:33:31.0294621Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.0295632Z context = <triton._C.libtriton.ir.context object at 0x7f89f4536770>
2025-05-07T20:33:31.0295918Z 
2025-05-07T20:33:31.0296079Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.0296650Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.0297150Z                            module_map=module_map)
2025-05-07T20:33:31.0297510Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.0297862Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.0298108Z E       ^
2025-05-07T20:33:31.0298563Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.0299021Z 
2025-05-07T20:33:31.0299436Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.0299958Z 
2025-05-07T20:33:31.0300063Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.0300467Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.0300905Z     T=128,
2025-05-07T20:33:31.0301089Z     D=7168,
2025-05-07T20:33:31.0301271Z     scale_ub=None,
2025-05-07T20:33:31.0301485Z     contiguous=False,
2025-05-07T20:33:31.0308050Z     compiled=True,
2025-05-07T20:33:31.0308266Z )
2025-05-07T20:33:31.0308593Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.0309127Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:31.0309398Z 
2025-05-07T20:33:31.0309477Z     @given(
2025-05-07T20:33:31.0309708Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.0310027Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.0310324Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.0310649Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.0310970Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.0311250Z     )
2025-05-07T20:33:31.0311612Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.0312068Z     def test_silu_mul_quant(
2025-05-07T20:33:31.0312307Z         self,
2025-05-07T20:33:31.0312503Z         T: int,
2025-05-07T20:33:31.0312702Z         D: int,
2025-05-07T20:33:31.0312917Z         scale_ub: Optional[float],
2025-05-07T20:33:31.0313184Z         contiguous: bool,
2025-05-07T20:33:31.0313420Z         compiled: bool,
2025-05-07T20:33:31.0313635Z     ) -> None:
2025-05-07T20:33:31.0313849Z         torch.manual_seed(2025)
2025-05-07T20:33:31.0314087Z     
2025-05-07T20:33:31.0314361Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.0314699Z     
2025-05-07T20:33:31.0314887Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.0315176Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.0315484Z         x = x_sign * x_clamp
2025-05-07T20:33:31.0315726Z         x0 = x[:, :D]
2025-05-07T20:33:31.0315949Z         x1 = x[:, D:]
2025-05-07T20:33:31.0316146Z     
2025-05-07T20:33:31.0316335Z         if contiguous:
2025-05-07T20:33:31.0316643Z             x0 = x0.contiguous()
2025-05-07T20:33:31.0316895Z             x1 = x1.contiguous()
2025-05-07T20:33:31.0317137Z     
2025-05-07T20:33:31.0317320Z         if scale_ub is not None:
2025-05-07T20:33:31.0317580Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.0317905Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.0318209Z             )
2025-05-07T20:33:31.0318391Z         else:
2025-05-07T20:33:31.0318593Z             scale_ub_tensor = None
2025-05-07T20:33:31.0318841Z     
2025-05-07T20:33:31.0319063Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.0319359Z             op = silu_mul_quant
2025-05-07T20:33:31.0319606Z             if compiled:
2025-05-07T20:33:31.0319849Z                 op = torch.compile(op)
2025-05-07T20:33:31.0320136Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.0320407Z     
2025-05-07T20:33:31.0320593Z         y_fp8, y_scale = fn()
2025-05-07T20:33:31.0320870Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:31.0321155Z     
2025-05-07T20:33:31.0321480Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.0321800Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:31.0322087Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:31.0322392Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:31.0322735Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:31.0323046Z     
2025-05-07T20:33:31.0323248Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:31.0323442Z 
2025-05-07T20:33:31.0323544Z moe/activation_test.py:126: 
2025-05-07T20:33:31.0323831Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.0324160Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:31.0324526Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:31.0325319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:31.0326066Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:31.0326622Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.0327296Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.0327979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:31.0328697Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:31.0329464Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:31.0330096Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:31.0330687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:31.0331196Z     fn()
2025-05-07T20:33:31.0331712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:31.0332298Z     self.fn.run(
2025-05-07T20:33:31.0332757Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.0333272Z     kernel = self.compile(
2025-05-07T20:33:31.0333826Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.0334483Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.0334872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.0335093Z 
2025-05-07T20:33:31.0335306Z self = <triton.compiler.compiler.ASTSource object at 0x7f89f42419d0>
2025-05-07T20:33:31.0336435Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.0337789Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89f4253d80>}
2025-05-07T20:33:31.0339187Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.0340560Z context = <triton._C.libtriton.ir.context object at 0x7f89f4481a30>
2025-05-07T20:33:31.0340838Z 
2025-05-07T20:33:31.0341005Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.0341508Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.0341973Z                            module_map=module_map)
2025-05-07T20:33:31.0342442Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.0342851Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:31.0343105Z E       ^
2025-05-07T20:33:31.0343556Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.0344018Z 
2025-05-07T20:33:31.0344451Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2753901Z 
2025-05-07T20:33:31.2754261Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2754889Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2755313Z     T=128,
2025-05-07T20:33:31.2755505Z     D=7168,
2025-05-07T20:33:31.2756009Z     scale_ub=None,
2025-05-07T20:33:31.2756232Z     contiguous=False,
2025-05-07T20:33:31.2756463Z     compiled=False,
2025-05-07T20:33:31.2756665Z )
2025-05-07T20:33:31.2756997Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2757497Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:31.2757760Z 
2025-05-07T20:33:31.2757845Z     @given(
2025-05-07T20:33:31.2758067Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2758374Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2758678Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2758999Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2759333Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2759616Z     )
2025-05-07T20:33:31.2759957Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2760413Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2760655Z         self,
2025-05-07T20:33:31.2760841Z         T: int,
2025-05-07T20:33:31.2761042Z         D: int,
2025-05-07T20:33:31.2761267Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2761530Z         contiguous: bool,
2025-05-07T20:33:31.2761769Z         compiled: bool,
2025-05-07T20:33:31.2761991Z     ) -> None:
2025-05-07T20:33:31.2762208Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2762447Z     
2025-05-07T20:33:31.2762715Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2763050Z     
2025-05-07T20:33:31.2763238Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2763532Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2763842Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2764079Z         x0 = x[:, :D]
2025-05-07T20:33:31.2764290Z         x1 = x[:, D:]
2025-05-07T20:33:31.2764500Z     
2025-05-07T20:33:31.2764675Z         if contiguous:
2025-05-07T20:33:31.2764907Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2765156Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2765383Z     
2025-05-07T20:33:31.2765659Z         if scale_ub is not None:
2025-05-07T20:33:31.2765929Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2766265Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2766576Z             )
2025-05-07T20:33:31.2766774Z         else:
2025-05-07T20:33:31.2766977Z             scale_ub_tensor = None
2025-05-07T20:33:31.2767227Z     
2025-05-07T20:33:31.2767451Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2767764Z             op = silu_mul_quant
2025-05-07T20:33:31.2768009Z             if compiled:
2025-05-07T20:33:31.2768254Z                 op = torch.compile(op)
2025-05-07T20:33:31.2768545Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2768818Z     
2025-05-07T20:33:31.2769007Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2769167Z 
2025-05-07T20:33:31.2769275Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2769565Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2769905Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2770344Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2771038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2771721Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2772270Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2772959Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2773609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2774131Z     kernel = self.compile(
2025-05-07T20:33:31.2774676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2775426Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2775811Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2776042Z 
2025-05-07T20:33:31.2776243Z self = <triton.compiler.compiler.ASTSource object at 0x7f89f4920f50>
2025-05-07T20:33:31.2777312Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2778683Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89cbd29760>}
2025-05-07T20:33:31.2779993Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2781002Z context = <triton._C.libtriton.ir.context object at 0x7f89cb5dc270>
2025-05-07T20:33:31.2781293Z 
2025-05-07T20:33:31.2781458Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2781969Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2782422Z                            module_map=module_map)
2025-05-07T20:33:31.2782787Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2783135Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2783384Z E       ^
2025-05-07T20:33:31.2783845Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2784303Z 
2025-05-07T20:33:31.2784717Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2785225Z 
2025-05-07T20:33:31.2785333Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2785776Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2786174Z     T=4096,
2025-05-07T20:33:31.2786362Z     D=5120,
2025-05-07T20:33:31.2786545Z     scale_ub=1200.0,
2025-05-07T20:33:31.2786756Z     contiguous=True,
2025-05-07T20:33:31.2786972Z     compiled=False,
2025-05-07T20:33:31.2787172Z )
2025-05-07T20:33:31.2787569Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2788061Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:31.2788324Z 
2025-05-07T20:33:31.2788405Z     @given(
2025-05-07T20:33:31.2788624Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2788931Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2789229Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2789548Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2789883Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2790167Z     )
2025-05-07T20:33:31.2790567Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2791069Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2791310Z         self,
2025-05-07T20:33:31.2791505Z         T: int,
2025-05-07T20:33:31.2791699Z         D: int,
2025-05-07T20:33:31.2791919Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2792189Z         contiguous: bool,
2025-05-07T20:33:31.2792419Z         compiled: bool,
2025-05-07T20:33:31.2792637Z     ) -> None:
2025-05-07T20:33:31.2792855Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2793091Z     
2025-05-07T20:33:31.2793358Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2793709Z     
2025-05-07T20:33:31.2793891Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2794226Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2794535Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2794764Z         x0 = x[:, :D]
2025-05-07T20:33:31.2794979Z         x1 = x[:, D:]
2025-05-07T20:33:31.2795179Z     
2025-05-07T20:33:31.2795365Z         if contiguous:
2025-05-07T20:33:31.2795586Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2795846Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2796085Z     
2025-05-07T20:33:31.2796265Z         if scale_ub is not None:
2025-05-07T20:33:31.2796532Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2796862Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2797163Z             )
2025-05-07T20:33:31.2797356Z         else:
2025-05-07T20:33:31.2797561Z             scale_ub_tensor = None
2025-05-07T20:33:31.2797797Z     
2025-05-07T20:33:31.2798022Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2798324Z             op = silu_mul_quant
2025-05-07T20:33:31.2798565Z             if compiled:
2025-05-07T20:33:31.2798808Z                 op = torch.compile(op)
2025-05-07T20:33:31.2799098Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2799359Z     
2025-05-07T20:33:31.2799550Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:31.2799715Z 
2025-05-07T20:33:31.2799812Z moe/activation_test.py:117: 
2025-05-07T20:33:31.2800101Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2800421Z moe/activation_test.py:115: in fn
2025-05-07T20:33:31.2800699Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2801395Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:31.2802067Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:31.2802605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2803279Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2803981Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2804508Z     kernel = self.compile(
2025-05-07T20:33:31.2805054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2805727Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2806109Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2806337Z 
2025-05-07T20:33:31.2806538Z self = <triton.compiler.compiler.ASTSource object at 0x7f89f4923850>
2025-05-07T20:33:31.2807678Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2809169Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89cbd2a3e0>}
2025-05-07T20:33:31.2810638Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2811693Z context = <triton._C.libtriton.ir.context object at 0x7f89f42dfc70>
2025-05-07T20:33:31.2811980Z 
2025-05-07T20:33:31.2812142Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2812666Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2813129Z                            module_map=module_map)
2025-05-07T20:33:31.2813484Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2813884Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:31.2814142Z E       ^
2025-05-07T20:33:31.2814595Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2815058Z 
2025-05-07T20:33:31.2815478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.2815988Z 
2025-05-07T20:33:31.2816091Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.2816498Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.2816888Z     T=1,
2025-05-07T20:33:31.2817067Z     D=5120,
2025-05-07T20:33:31.2817255Z     scale_ub=None,
2025-05-07T20:33:31.2817458Z     contiguous=True,
2025-05-07T20:33:31.2817677Z     compiled=True,
2025-05-07T20:33:31.2817880Z )
2025-05-07T20:33:31.2818191Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.2818674Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:31.2818944Z 
2025-05-07T20:33:31.2819022Z     @given(
2025-05-07T20:33:31.2819248Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.2819547Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.2819855Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.2820186Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.2820501Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.2820780Z     )
2025-05-07T20:33:31.2821131Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.2821574Z     def test_silu_mul_quant(
2025-05-07T20:33:31.2821811Z         self,
2025-05-07T20:33:31.2822003Z         T: int,
2025-05-07T20:33:31.2822189Z         D: int,
2025-05-07T20:33:31.2822404Z         scale_ub: Optional[float],
2025-05-07T20:33:31.2822668Z         contiguous: bool,
2025-05-07T20:33:31.2822902Z         compiled: bool,
2025-05-07T20:33:31.2823122Z     ) -> None:
2025-05-07T20:33:31.2823334Z         torch.manual_seed(2025)
2025-05-07T20:33:31.2823573Z     
2025-05-07T20:33:31.2823886Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.2824224Z     
2025-05-07T20:33:31.2824418Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.2824697Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.2825009Z         x = x_sign * x_clamp
2025-05-07T20:33:31.2825246Z         x0 = x[:, :D]
2025-05-07T20:33:31.2825453Z         x1 = x[:, D:]
2025-05-07T20:33:31.2825657Z     
2025-05-07T20:33:31.2825837Z         if contiguous:
2025-05-07T20:33:31.2826057Z             x0 = x0.contiguous()
2025-05-07T20:33:31.2826311Z             x1 = x1.contiguous()
2025-05-07T20:33:31.2826544Z     
2025-05-07T20:33:31.2826725Z         if scale_ub is not None:
2025-05-07T20:33:31.2826997Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.2827331Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.2827727Z             )
2025-05-07T20:33:31.2827917Z         else:
2025-05-07T20:33:31.2828123Z             scale_ub_tensor = None
2025-05-07T20:33:31.2828372Z     
2025-05-07T20:33:31.2828642Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2828989Z             op = silu_mul_quant
2025-05-07T20:33:31.2829234Z             if compiled:
2025-05-07T20:33:31.2829472Z                 op = torch.compile(op)
2025-05-07T20:33:31.2829762Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.2830027Z     
2025-05-07T20:33:31.2830206Z         y_fp8, y_scale = fn()
2025-05-07T20:33:31.2830487Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:31.2830769Z     
2025-05-07T20:33:31.2830993Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.2831319Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:31.2831605Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:31.2831950Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:31.2832300Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:31.2832602Z     
2025-05-07T20:33:31.2832800Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:31.2832990Z 
2025-05-07T20:33:31.2833089Z moe/activation_test.py:126: 
2025-05-07T20:33:31.2833377Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2833705Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:31.2834022Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:31.2834816Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:31.2835554Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:31.2836095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.2836760Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.2837446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:31.2838160Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:31.2838904Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:31.2839567Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:31.2840475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:31.2841007Z     fn()
2025-05-07T20:33:31.2841524Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:31.2842113Z     self.fn.run(
2025-05-07T20:33:31.2842578Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.2843106Z     kernel = self.compile(
2025-05-07T20:33:31.2843751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.2844417Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.2844815Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.2845046Z 
2025-05-07T20:33:31.2845254Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cbcdaa80>
2025-05-07T20:33:31.2846329Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.2847676Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89cbd2b060>}
2025-05-07T20:33:31.2849066Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.2850158Z context = <triton._C.libtriton.ir.context object at 0x7f89f4398d30>
2025-05-07T20:33:31.2850438Z 
2025-05-07T20:33:31.2850601Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.2851116Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.2851581Z                            module_map=module_map)
2025-05-07T20:33:31.2851932Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.2852296Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:31.2852561Z E       ^
2025-05-07T20:33:31.2853022Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.2853536Z 
2025-05-07T20:33:31.2853964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.9738984Z 
2025-05-07T20:33:31.9739650Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.9740452Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.9740997Z     T=2048,
2025-05-07T20:33:31.9741235Z     D=5120,
2025-05-07T20:33:31.9741473Z     scale_ub=None,
2025-05-07T20:33:31.9741733Z     contiguous=True,
2025-05-07T20:33:31.9741996Z     compiled=True,
2025-05-07T20:33:31.9742203Z )
2025-05-07T20:33:31.9742513Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.9742998Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:31.9743439Z 
2025-05-07T20:33:31.9743514Z     @given(
2025-05-07T20:33:31.9743736Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.9744044Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.9744348Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.9751310Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.9751690Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.9751970Z     )
2025-05-07T20:33:31.9752325Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.9752772Z     def test_silu_mul_quant(
2025-05-07T20:33:31.9753003Z         self,
2025-05-07T20:33:31.9753196Z         T: int,
2025-05-07T20:33:31.9753391Z         D: int,
2025-05-07T20:33:31.9753599Z         scale_ub: Optional[float],
2025-05-07T20:33:31.9753867Z         contiguous: bool,
2025-05-07T20:33:31.9754102Z         compiled: bool,
2025-05-07T20:33:31.9754327Z     ) -> None:
2025-05-07T20:33:31.9754533Z         torch.manual_seed(2025)
2025-05-07T20:33:31.9754777Z     
2025-05-07T20:33:31.9755057Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.9755398Z     
2025-05-07T20:33:31.9755587Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.9756154Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.9756463Z         x = x_sign * x_clamp
2025-05-07T20:33:31.9756696Z         x0 = x[:, :D]
2025-05-07T20:33:31.9756908Z         x1 = x[:, D:]
2025-05-07T20:33:31.9757106Z     
2025-05-07T20:33:31.9757282Z         if contiguous:
2025-05-07T20:33:31.9757505Z             x0 = x0.contiguous()
2025-05-07T20:33:31.9757747Z             x1 = x1.contiguous()
2025-05-07T20:33:31.9757979Z     
2025-05-07T20:33:31.9758157Z         if scale_ub is not None:
2025-05-07T20:33:31.9758418Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.9758747Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.9759056Z             )
2025-05-07T20:33:31.9759238Z         else:
2025-05-07T20:33:31.9759439Z             scale_ub_tensor = None
2025-05-07T20:33:31.9759685Z     
2025-05-07T20:33:31.9759907Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.9760202Z             op = silu_mul_quant
2025-05-07T20:33:31.9760450Z             if compiled:
2025-05-07T20:33:31.9760901Z                 op = torch.compile(op)
2025-05-07T20:33:31.9761184Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.9761452Z     
2025-05-07T20:33:31.9761639Z         y_fp8, y_scale = fn()
2025-05-07T20:33:31.9761911Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:31.9762190Z     
2025-05-07T20:33:31.9762416Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.9762734Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:31.9763016Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:31.9763349Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:31.9763698Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:31.9764101Z     
2025-05-07T20:33:31.9764289Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:31.9764485Z 
2025-05-07T20:33:31.9764582Z moe/activation_test.py:126: 
2025-05-07T20:33:31.9764875Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.9765205Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:31.9765522Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:31.9766318Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:31.9767054Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:31.9767593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.9768293Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.9768986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:31.9769708Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:31.9770429Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:31.9771061Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:31.9771653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:31.9772179Z     fn()
2025-05-07T20:33:31.9772693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:31.9773267Z     self.fn.run(
2025-05-07T20:33:31.9773732Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.9774240Z     kernel = self.compile(
2025-05-07T20:33:31.9774784Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.9775496Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.9775886Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.9776111Z 
2025-05-07T20:33:31.9776315Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cbcdab70>
2025-05-07T20:33:31.9777378Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.9778746Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89cb52ff60>}
2025-05-07T20:33:31.9780107Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.9781110Z context = <triton._C.libtriton.ir.context object at 0x7f89cb4205b0>
2025-05-07T20:33:31.9781451Z 
2025-05-07T20:33:31.9781654Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.9782172Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.9782639Z                            module_map=module_map)
2025-05-07T20:33:31.9782988Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.9783346Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:31.9783606Z E       ^
2025-05-07T20:33:31.9784053Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.9784522Z 
2025-05-07T20:33:31.9784952Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:31.9785502Z 
2025-05-07T20:33:31.9785602Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:31.9786011Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:31.9786394Z     T=128,
2025-05-07T20:33:31.9786578Z     D=5120,
2025-05-07T20:33:31.9786761Z     scale_ub=None,
2025-05-07T20:33:31.9786964Z     contiguous=True,
2025-05-07T20:33:31.9787183Z     compiled=True,
2025-05-07T20:33:31.9787382Z )
2025-05-07T20:33:31.9787750Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:31.9788228Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:31.9788486Z 
2025-05-07T20:33:31.9788563Z     @given(
2025-05-07T20:33:31.9788785Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:31.9789089Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:31.9789388Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:31.9789712Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:31.9790029Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:31.9790314Z     )
2025-05-07T20:33:31.9790662Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:31.9791098Z     def test_silu_mul_quant(
2025-05-07T20:33:31.9791330Z         self,
2025-05-07T20:33:31.9791519Z         T: int,
2025-05-07T20:33:31.9791710Z         D: int,
2025-05-07T20:33:31.9791914Z         scale_ub: Optional[float],
2025-05-07T20:33:31.9792178Z         contiguous: bool,
2025-05-07T20:33:31.9792416Z         compiled: bool,
2025-05-07T20:33:31.9792629Z     ) -> None:
2025-05-07T20:33:31.9792844Z         torch.manual_seed(2025)
2025-05-07T20:33:31.9793083Z     
2025-05-07T20:33:31.9793349Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:31.9793685Z     
2025-05-07T20:33:31.9793871Z         x_sign = torch.sign(x)
2025-05-07T20:33:31.9794155Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:31.9794461Z         x = x_sign * x_clamp
2025-05-07T20:33:31.9794746Z         x0 = x[:, :D]
2025-05-07T20:33:31.9794951Z         x1 = x[:, D:]
2025-05-07T20:33:31.9795159Z     
2025-05-07T20:33:31.9795339Z         if contiguous:
2025-05-07T20:33:31.9795560Z             x0 = x0.contiguous()
2025-05-07T20:33:31.9795813Z             x1 = x1.contiguous()
2025-05-07T20:33:31.9796046Z     
2025-05-07T20:33:31.9796222Z         if scale_ub is not None:
2025-05-07T20:33:31.9796492Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:31.9796828Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:31.9797121Z             )
2025-05-07T20:33:31.9797301Z         else:
2025-05-07T20:33:31.9797504Z             scale_ub_tensor = None
2025-05-07T20:33:31.9797753Z     
2025-05-07T20:33:31.9797974Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.9798277Z             op = silu_mul_quant
2025-05-07T20:33:31.9798518Z             if compiled:
2025-05-07T20:33:31.9798753Z                 op = torch.compile(op)
2025-05-07T20:33:31.9799049Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:31.9799365Z     
2025-05-07T20:33:31.9799583Z         y_fp8, y_scale = fn()
2025-05-07T20:33:31.9799860Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:31.9800140Z     
2025-05-07T20:33:31.9800360Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:31.9800687Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:31.9800967Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:31.9801274Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:31.9801621Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:31.9801924Z     
2025-05-07T20:33:31.9802127Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:31.9802315Z 
2025-05-07T20:33:31.9802412Z moe/activation_test.py:126: 
2025-05-07T20:33:31.9802748Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.9803078Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:31.9803393Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:31.9804171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:31.9804925Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:31.9805465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:31.9806160Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:31.9806841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:31.9807566Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:31.9808327Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:31.9808946Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:31.9809543Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:31.9810060Z     fn()
2025-05-07T20:33:31.9810575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:31.9811143Z     self.fn.run(
2025-05-07T20:33:31.9811609Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:31.9812114Z     kernel = self.compile(
2025-05-07T20:33:31.9812642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:31.9813295Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:31.9813690Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:31.9813909Z 
2025-05-07T20:33:31.9814172Z self = <triton.compiler.compiler.ASTSource object at 0x7f89f40c9ef0>
2025-05-07T20:33:31.9815264Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:31.9816708Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89cb315f80>}
2025-05-07T20:33:31.9818037Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:31.9819082Z context = <triton._C.libtriton.ir.context object at 0x7f89cb244170>
2025-05-07T20:33:31.9819361Z 
2025-05-07T20:33:31.9819523Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:31.9820076Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:31.9820577Z                            module_map=module_map)
2025-05-07T20:33:31.9820931Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:31.9821278Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:31.9821535Z E       ^
2025-05-07T20:33:31.9821987Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:31.9822432Z 
2025-05-07T20:33:31.9822855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:32.7622344Z 
2025-05-07T20:33:32.7623464Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:32.7624818Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:32.7625625Z     T=4096,
2025-05-07T20:33:32.7626000Z     D=5120,
2025-05-07T20:33:32.7626367Z     scale_ub=None,
2025-05-07T20:33:32.7626792Z     contiguous=True,
2025-05-07T20:33:32.7627229Z     compiled=True,
2025-05-07T20:33:32.7627761Z )
2025-05-07T20:33:32.7628373Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:32.7629268Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:32.7629592Z 
2025-05-07T20:33:32.7629669Z     @given(
2025-05-07T20:33:32.7629896Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:32.7630196Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:32.7630499Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:32.7630823Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:32.7631141Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:32.7631426Z     )
2025-05-07T20:33:32.7631779Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:32.7632231Z     def test_silu_mul_quant(
2025-05-07T20:33:32.7632477Z         self,
2025-05-07T20:33:32.7632676Z         T: int,
2025-05-07T20:33:32.7632874Z         D: int,
2025-05-07T20:33:32.7633087Z         scale_ub: Optional[float],
2025-05-07T20:33:32.7633361Z         contiguous: bool,
2025-05-07T20:33:32.7633599Z         compiled: bool,
2025-05-07T20:33:32.7633813Z     ) -> None:
2025-05-07T20:33:32.7634029Z         torch.manual_seed(2025)
2025-05-07T20:33:32.7634266Z     
2025-05-07T20:33:32.7634530Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:32.7634868Z     
2025-05-07T20:33:32.7635064Z         x_sign = torch.sign(x)
2025-05-07T20:33:32.7635344Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:32.7635661Z         x = x_sign * x_clamp
2025-05-07T20:33:32.7635899Z         x0 = x[:, :D]
2025-05-07T20:33:32.7636104Z         x1 = x[:, D:]
2025-05-07T20:33:32.7636307Z     
2025-05-07T20:33:32.7636490Z         if contiguous:
2025-05-07T20:33:32.7636809Z             x0 = x0.contiguous()
2025-05-07T20:33:32.7637071Z             x1 = x1.contiguous()
2025-05-07T20:33:32.7637309Z     
2025-05-07T20:33:32.7637490Z         if scale_ub is not None:
2025-05-07T20:33:32.7637760Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:32.7638094Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:32.7638396Z             )
2025-05-07T20:33:32.7638578Z         else:
2025-05-07T20:33:32.7638786Z             scale_ub_tensor = None
2025-05-07T20:33:32.7639033Z     
2025-05-07T20:33:32.7639284Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:32.7639618Z             op = silu_mul_quant
2025-05-07T20:33:32.7639867Z             if compiled:
2025-05-07T20:33:32.7640422Z                 op = torch.compile(op)
2025-05-07T20:33:32.7640721Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:32.7640987Z     
2025-05-07T20:33:32.7641178Z         y_fp8, y_scale = fn()
2025-05-07T20:33:32.7641465Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:32.7641839Z     
2025-05-07T20:33:32.7642149Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:32.7642481Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:32.7642764Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:32.7643080Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:32.7643435Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:32.7643736Z     
2025-05-07T20:33:32.7643941Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:32.7644142Z 
2025-05-07T20:33:32.7644241Z moe/activation_test.py:126: 
2025-05-07T20:33:32.7644538Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:32.7644930Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:32.7645258Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:32.7646049Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:32.7646845Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:32.7647385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:32.7648064Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:32.7648750Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:32.7649517Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:32.7650239Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:32.7650875Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:32.7651482Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:32.7651990Z     fn()
2025-05-07T20:33:32.7652500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:32.7653094Z     self.fn.run(
2025-05-07T20:33:32.7653556Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:32.7654075Z     kernel = self.compile(
2025-05-07T20:33:32.7654614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:32.7655281Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:32.7655667Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:32.7655898Z 
2025-05-07T20:33:32.7656108Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cb27eb30>
2025-05-07T20:33:32.7657253Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:32.7658646Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89cb4be340>}
2025-05-07T20:33:32.7660004Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:32.7661065Z context = <triton._C.libtriton.ir.context object at 0x7f89cb139230>
2025-05-07T20:33:32.7661352Z 
2025-05-07T20:33:32.7661517Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:32.7662035Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:32.7662491Z                            module_map=module_map)
2025-05-07T20:33:32.7662942Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:32.7663305Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:32.7663566Z E       ^
2025-05-07T20:33:32.7664024Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:32.7664481Z 
2025-05-07T20:33:32.7664908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:32.7665412Z 
2025-05-07T20:33:32.7665518Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:32.7665918Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:32.7666372Z     T=16384,
2025-05-07T20:33:32.7666569Z     D=5120,
2025-05-07T20:33:32.7666760Z     scale_ub=None,
2025-05-07T20:33:32.7666977Z     contiguous=True,
2025-05-07T20:33:32.7667207Z     compiled=True,
2025-05-07T20:33:32.7667468Z )
2025-05-07T20:33:32.7667787Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:32.7668278Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:32.7668545Z 
2025-05-07T20:33:32.7668634Z     @given(
2025-05-07T20:33:32.7668855Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:32.7669166Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:32.7669472Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:32.7669793Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:32.7670120Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:32.7670405Z     )
2025-05-07T20:33:32.7670759Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:32.7671197Z     def test_silu_mul_quant(
2025-05-07T20:33:32.7671442Z         self,
2025-05-07T20:33:32.7671633Z         T: int,
2025-05-07T20:33:32.7671830Z         D: int,
2025-05-07T20:33:32.7672052Z         scale_ub: Optional[float],
2025-05-07T20:33:32.7672321Z         contiguous: bool,
2025-05-07T20:33:32.7672557Z         compiled: bool,
2025-05-07T20:33:32.7672780Z     ) -> None:
2025-05-07T20:33:32.7672991Z         torch.manual_seed(2025)
2025-05-07T20:33:32.7673225Z     
2025-05-07T20:33:32.7673495Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:32.7673842Z     
2025-05-07T20:33:32.7674026Z         x_sign = torch.sign(x)
2025-05-07T20:33:32.7674312Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:32.7674622Z         x = x_sign * x_clamp
2025-05-07T20:33:32.7674857Z         x0 = x[:, :D]
2025-05-07T20:33:32.7675076Z         x1 = x[:, D:]
2025-05-07T20:33:32.7675281Z     
2025-05-07T20:33:32.7675461Z         if contiguous:
2025-05-07T20:33:32.7675686Z             x0 = x0.contiguous()
2025-05-07T20:33:32.7675944Z             x1 = x1.contiguous()
2025-05-07T20:33:32.7676188Z     
2025-05-07T20:33:32.7676423Z         if scale_ub is not None:
2025-05-07T20:33:32.7676695Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:32.7677028Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:32.7677327Z             )
2025-05-07T20:33:32.7677521Z         else:
2025-05-07T20:33:32.7677733Z             scale_ub_tensor = None
2025-05-07T20:33:32.7677979Z     
2025-05-07T20:33:32.7678211Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:32.7678524Z             op = silu_mul_quant
2025-05-07T20:33:32.7678768Z             if compiled:
2025-05-07T20:33:32.7679020Z                 op = torch.compile(op)
2025-05-07T20:33:32.7679327Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:32.7679593Z     
2025-05-07T20:33:32.7679785Z         y_fp8, y_scale = fn()
2025-05-07T20:33:32.7680068Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:32.7680347Z     
2025-05-07T20:33:32.7680592Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:32.7681017Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:32.7681310Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:32.7681616Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:32.7681973Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:32.7682280Z     
2025-05-07T20:33:32.7682478Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:32.7682684Z 
2025-05-07T20:33:32.7682783Z moe/activation_test.py:126: 
2025-05-07T20:33:32.7683080Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:32.7683406Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:32.7683738Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:32.7684554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:32.7685300Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:32.7685847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:32.7686528Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:32.7687216Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:32.7687928Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:32.7688668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:32.7689339Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:32.7689961Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:32.7690471Z     fn()
2025-05-07T20:33:32.7691002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:32.7691606Z     self.fn.run(
2025-05-07T20:33:32.7692079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:32.7692593Z     kernel = self.compile(
2025-05-07T20:33:32.7693156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:32.7693803Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:32.7694194Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:32.7694420Z 
2025-05-07T20:33:32.7694623Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cb262000>
2025-05-07T20:33:32.7695771Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:32.7697139Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89cb8a94e0>}
2025-05-07T20:33:32.7698500Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:32.7699551Z context = <triton._C.libtriton.ir.context object at 0x7f89cab629b0>
2025-05-07T20:33:32.7699839Z 
2025-05-07T20:33:32.7700001Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:32.7700518Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:32.7700982Z                            module_map=module_map)
2025-05-07T20:33:32.7701342Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:32.7708019Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:32.7708455Z E       ^
2025-05-07T20:33:32.7708921Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:32.7709410Z 
2025-05-07T20:33:32.7709863Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:32.7898197Z W0507 20:33:32.788000 89776 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] torch._dynamo hit config.recompile_limit (8)
2025-05-07T20:33:32.7899999Z W0507 20:33:32.788000 89776 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    function: 'silu_mul_quant' (/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:55)
2025-05-07T20:33:32.7901565Z W0507 20:33:32.788000 89776 site-packages/torch/_dynamo/convert_frame.py:987] [0/8]    last reason: 0/7: tensor 'x0' stride mismatch at index 0. expected 5120, actual 10240
2025-05-07T20:33:32.7902540Z W0507 20:33:32.788000 89776 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
2025-05-07T20:33:32.7903690Z W0507 20:33:32.788000 89776 site-packages/torch/_dynamo/convert_frame.py:987] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
2025-05-07T20:33:33.1918697Z 
2025-05-07T20:33:33.1919009Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.1919498Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.1920056Z     T=1,
2025-05-07T20:33:33.1920285Z     D=5120,
2025-05-07T20:33:33.1920529Z     scale_ub=1200.0,
2025-05-07T20:33:33.1920811Z     contiguous=True,
2025-05-07T20:33:33.1921103Z     compiled=True,
2025-05-07T20:33:33.1921326Z )
2025-05-07T20:33:33.1921651Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.1922151Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:33.1922422Z 
2025-05-07T20:33:33.1922500Z     @given(
2025-05-07T20:33:33.1922728Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.1923036Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.1923335Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.1923664Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.1923989Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.1924263Z     )
2025-05-07T20:33:33.1924606Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.1925051Z     def test_silu_mul_quant(
2025-05-07T20:33:33.1925289Z         self,
2025-05-07T20:33:33.1925475Z         T: int,
2025-05-07T20:33:33.1925672Z         D: int,
2025-05-07T20:33:33.1925886Z         scale_ub: Optional[float],
2025-05-07T20:33:33.1926153Z         contiguous: bool,
2025-05-07T20:33:33.1926665Z         compiled: bool,
2025-05-07T20:33:33.1926893Z     ) -> None:
2025-05-07T20:33:33.1927104Z         torch.manual_seed(2025)
2025-05-07T20:33:33.1927345Z     
2025-05-07T20:33:33.1927614Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.1927947Z     
2025-05-07T20:33:33.1928133Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.1928413Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.1928707Z         x = x_sign * x_clamp
2025-05-07T20:33:33.1928943Z         x0 = x[:, :D]
2025-05-07T20:33:33.1929155Z         x1 = x[:, D:]
2025-05-07T20:33:33.1929356Z     
2025-05-07T20:33:33.1929540Z         if contiguous:
2025-05-07T20:33:33.1929770Z             x0 = x0.contiguous()
2025-05-07T20:33:33.1930018Z             x1 = x1.contiguous()
2025-05-07T20:33:33.1930286Z     
2025-05-07T20:33:33.1930467Z         if scale_ub is not None:
2025-05-07T20:33:33.1930739Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.1931081Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.1931587Z             )
2025-05-07T20:33:33.1931784Z         else:
2025-05-07T20:33:33.1931997Z             scale_ub_tensor = None
2025-05-07T20:33:33.1932247Z     
2025-05-07T20:33:33.1932473Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.1932781Z             op = silu_mul_quant
2025-05-07T20:33:33.1933037Z             if compiled:
2025-05-07T20:33:33.1933278Z                 op = torch.compile(op)
2025-05-07T20:33:33.1933574Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.1933849Z     
2025-05-07T20:33:33.1934034Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.1934206Z 
2025-05-07T20:33:33.1934306Z moe/activation_test.py:117: 
2025-05-07T20:33:33.1934601Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.1935007Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.1935285Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.1935857Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:33.1936413Z     return fn(*args, **kwargs)
2025-05-07T20:33:33.1937054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.1937723Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.1938262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.1938922Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.1939582Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.1940396Z     kernel = self.compile(
2025-05-07T20:33:33.1940949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.1941590Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.1941979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.1942204Z 
2025-05-07T20:33:33.1942414Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cb1f4410>
2025-05-07T20:33:33.1943481Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.1944850Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89cb8f9d00>}
2025-05-07T20:33:33.1946256Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.1947324Z context = <triton._C.libtriton.ir.context object at 0x7f89cabfbdf0>
2025-05-07T20:33:33.1947676Z 
2025-05-07T20:33:33.1947850Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.1948354Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.1948826Z                            module_map=module_map)
2025-05-07T20:33:33.1949192Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.1949539Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.1949781Z E       ^
2025-05-07T20:33:33.1950244Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.1950700Z 
2025-05-07T20:33:33.1951132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.1951632Z 
2025-05-07T20:33:33.1951734Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.1952262Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.1952663Z     T=1,
2025-05-07T20:33:33.1952844Z     D=5120,
2025-05-07T20:33:33.1953022Z     scale_ub=None,
2025-05-07T20:33:33.1953230Z     contiguous=False,
2025-05-07T20:33:33.1953445Z     compiled=True,
2025-05-07T20:33:33.1953636Z )
2025-05-07T20:33:33.1953951Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.1954431Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:33.1954694Z 
2025-05-07T20:33:33.1954771Z     @given(
2025-05-07T20:33:33.1954996Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.1955305Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.1955670Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.1955996Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.1956322Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.1956608Z     )
2025-05-07T20:33:33.1956943Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.1957393Z     def test_silu_mul_quant(
2025-05-07T20:33:33.1957630Z         self,
2025-05-07T20:33:33.1957818Z         T: int,
2025-05-07T20:33:33.1958012Z         D: int,
2025-05-07T20:33:33.1958224Z         scale_ub: Optional[float],
2025-05-07T20:33:33.1958490Z         contiguous: bool,
2025-05-07T20:33:33.1958724Z         compiled: bool,
2025-05-07T20:33:33.1958939Z     ) -> None:
2025-05-07T20:33:33.1959149Z         torch.manual_seed(2025)
2025-05-07T20:33:33.1959387Z     
2025-05-07T20:33:33.1959654Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.1959983Z     
2025-05-07T20:33:33.1960173Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.1960454Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.1960760Z         x = x_sign * x_clamp
2025-05-07T20:33:33.1960989Z         x0 = x[:, :D]
2025-05-07T20:33:33.1961204Z         x1 = x[:, D:]
2025-05-07T20:33:33.1961402Z     
2025-05-07T20:33:33.1961570Z         if contiguous:
2025-05-07T20:33:33.1961796Z             x0 = x0.contiguous()
2025-05-07T20:33:33.1962049Z             x1 = x1.contiguous()
2025-05-07T20:33:33.1962283Z     
2025-05-07T20:33:33.1962468Z         if scale_ub is not None:
2025-05-07T20:33:33.1962734Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.1963060Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.1963372Z             )
2025-05-07T20:33:33.1963567Z         else:
2025-05-07T20:33:33.1963764Z             scale_ub_tensor = None
2025-05-07T20:33:33.1964008Z     
2025-05-07T20:33:33.1964235Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.1964539Z             op = silu_mul_quant
2025-05-07T20:33:33.1964785Z             if compiled:
2025-05-07T20:33:33.1965089Z                 op = torch.compile(op)
2025-05-07T20:33:33.1965377Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.1965644Z     
2025-05-07T20:33:33.1965828Z         y_fp8, y_scale = fn()
2025-05-07T20:33:33.1966108Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:33.1966383Z     
2025-05-07T20:33:33.1966612Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.1966937Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:33.1967217Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:33.1967523Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:33.1967880Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:33.1968178Z     
2025-05-07T20:33:33.1968378Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:33.1968570Z 
2025-05-07T20:33:33.1968676Z moe/activation_test.py:126: 
2025-05-07T20:33:33.1968968Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.1969310Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:33.1969751Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:33.1970548Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:33.1971298Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:33.1971833Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.1972511Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.1973192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:33.1973940Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:33.1974679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:33.1975344Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:33.1975945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:33.1976442Z     fn()
2025-05-07T20:33:33.1976969Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:33.1977539Z     self.fn.run(
2025-05-07T20:33:33.1978001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.1978520Z     kernel = self.compile(
2025-05-07T20:33:33.1979078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.1979752Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.1980142Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.1980370Z 
2025-05-07T20:33:33.1980574Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cb1f6b10>
2025-05-07T20:33:33.1981632Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.1982982Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca706de0>}
2025-05-07T20:33:33.1984292Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.1985303Z context = <triton._C.libtriton.ir.context object at 0x7f89ca58d770>
2025-05-07T20:33:33.1985589Z 
2025-05-07T20:33:33.1985799Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.1986336Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.1986798Z                            module_map=module_map)
2025-05-07T20:33:33.1987161Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.1987590Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:33.1987844Z E       ^
2025-05-07T20:33:33.1988307Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.1988772Z 
2025-05-07T20:33:33.1989193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.3387643Z 
2025-05-07T20:33:33.3387942Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.3388497Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.3389069Z     T=1,
2025-05-07T20:33:33.3389585Z     D=5120,
2025-05-07T20:33:33.3389858Z     scale_ub=None,
2025-05-07T20:33:33.3390080Z     contiguous=True,
2025-05-07T20:33:33.3390306Z     compiled=False,
2025-05-07T20:33:33.3390503Z )
2025-05-07T20:33:33.3390833Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.3391313Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:33.3391578Z 
2025-05-07T20:33:33.3391666Z     @given(
2025-05-07T20:33:33.3391886Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.3392194Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.3392498Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.3392818Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.3393238Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.3393522Z     )
2025-05-07T20:33:33.3393861Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.3394302Z     def test_silu_mul_quant(
2025-05-07T20:33:33.3394545Z         self,
2025-05-07T20:33:33.3394735Z         T: int,
2025-05-07T20:33:33.3394925Z         D: int,
2025-05-07T20:33:33.3395146Z         scale_ub: Optional[float],
2025-05-07T20:33:33.3395412Z         contiguous: bool,
2025-05-07T20:33:33.3395646Z         compiled: bool,
2025-05-07T20:33:33.3395872Z     ) -> None:
2025-05-07T20:33:33.3396087Z         torch.manual_seed(2025)
2025-05-07T20:33:33.3396320Z     
2025-05-07T20:33:33.3396618Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.3396956Z     
2025-05-07T20:33:33.3397146Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.3397432Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.3397744Z         x = x_sign * x_clamp
2025-05-07T20:33:33.3397989Z         x0 = x[:, :D]
2025-05-07T20:33:33.3398204Z         x1 = x[:, D:]
2025-05-07T20:33:33.3398403Z     
2025-05-07T20:33:33.3398589Z         if contiguous:
2025-05-07T20:33:33.3398821Z             x0 = x0.contiguous()
2025-05-07T20:33:33.3399066Z             x1 = x1.contiguous()
2025-05-07T20:33:33.3399311Z     
2025-05-07T20:33:33.3399500Z         if scale_ub is not None:
2025-05-07T20:33:33.3399763Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.3400098Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.3400414Z             )
2025-05-07T20:33:33.3400601Z         else:
2025-05-07T20:33:33.3400816Z             scale_ub_tensor = None
2025-05-07T20:33:33.3401073Z     
2025-05-07T20:33:33.3401308Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.3401608Z             op = silu_mul_quant
2025-05-07T20:33:33.3401869Z             if compiled:
2025-05-07T20:33:33.3402117Z                 op = torch.compile(op)
2025-05-07T20:33:33.3402409Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.3402690Z     
2025-05-07T20:33:33.3403007Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.3403177Z 
2025-05-07T20:33:33.3403279Z moe/activation_test.py:117: 
2025-05-07T20:33:33.3403578Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.3403909Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.3404187Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.3404893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.3405581Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.3406129Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.3406797Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.3407469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.3408002Z     kernel = self.compile(
2025-05-07T20:33:33.3408594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.3409285Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.3409682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.3409903Z 
2025-05-07T20:33:33.3410113Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cac16fc0>
2025-05-07T20:33:33.3411218Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.3412594Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89cb4be700>}
2025-05-07T20:33:33.3413964Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.3414980Z context = <triton._C.libtriton.ir.context object at 0x7f89ca202830>
2025-05-07T20:33:33.3415261Z 
2025-05-07T20:33:33.3415433Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.3415944Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.3416411Z                            module_map=module_map)
2025-05-07T20:33:33.3416769Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.3417118Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.3417374Z E       ^
2025-05-07T20:33:33.3417833Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.3418284Z 
2025-05-07T20:33:33.3418716Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.3419224Z 
2025-05-07T20:33:33.3419322Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.3419725Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.3420118Z     T=128,
2025-05-07T20:33:33.3420302Z     D=5120,
2025-05-07T20:33:33.3420481Z     scale_ub=None,
2025-05-07T20:33:33.3420696Z     contiguous=False,
2025-05-07T20:33:33.3420921Z     compiled=True,
2025-05-07T20:33:33.3421117Z )
2025-05-07T20:33:33.3421439Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.3421920Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:33.3422196Z 
2025-05-07T20:33:33.3422273Z     @given(
2025-05-07T20:33:33.3422499Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.3422860Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.3423161Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.3423498Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.3423825Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.3424105Z     )
2025-05-07T20:33:33.3424458Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.3424903Z     def test_silu_mul_quant(
2025-05-07T20:33:33.3425147Z         self,
2025-05-07T20:33:33.3425331Z         T: int,
2025-05-07T20:33:33.3425530Z         D: int,
2025-05-07T20:33:33.3425754Z         scale_ub: Optional[float],
2025-05-07T20:33:33.3426025Z         contiguous: bool,
2025-05-07T20:33:33.3426260Z         compiled: bool,
2025-05-07T20:33:33.3426483Z     ) -> None:
2025-05-07T20:33:33.3426689Z         torch.manual_seed(2025)
2025-05-07T20:33:33.3426930Z     
2025-05-07T20:33:33.3427196Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.3427618Z     
2025-05-07T20:33:33.3427809Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.3428190Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.3428492Z         x = x_sign * x_clamp
2025-05-07T20:33:33.3428734Z         x0 = x[:, :D]
2025-05-07T20:33:33.3428948Z         x1 = x[:, D:]
2025-05-07T20:33:33.3429150Z     
2025-05-07T20:33:33.3429332Z         if contiguous:
2025-05-07T20:33:33.3429559Z             x0 = x0.contiguous()
2025-05-07T20:33:33.3429813Z             x1 = x1.contiguous()
2025-05-07T20:33:33.3430047Z     
2025-05-07T20:33:33.3430236Z         if scale_ub is not None:
2025-05-07T20:33:33.3430509Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.3430833Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.3431144Z             )
2025-05-07T20:33:33.3431385Z         else:
2025-05-07T20:33:33.3431586Z             scale_ub_tensor = None
2025-05-07T20:33:33.3431835Z     
2025-05-07T20:33:33.3432071Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.3432373Z             op = silu_mul_quant
2025-05-07T20:33:33.3432630Z             if compiled:
2025-05-07T20:33:33.3432877Z                 op = torch.compile(op)
2025-05-07T20:33:33.3433171Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.3433441Z     
2025-05-07T20:33:33.3433632Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.3433793Z 
2025-05-07T20:33:33.3433897Z moe/activation_test.py:117: 
2025-05-07T20:33:33.3434183Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.3434512Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.3434799Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.3435351Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:33.3435916Z     return fn(*args, **kwargs)
2025-05-07T20:33:33.3436584Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.3437273Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.3437808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.3438483Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.3439139Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.3439658Z     kernel = self.compile(
2025-05-07T20:33:33.3440513Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.3441176Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.3441570Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.3441796Z 
2025-05-07T20:33:33.3442093Z self = <triton.compiler.compiler.ASTSource object at 0x7f89ca708e10>
2025-05-07T20:33:33.3443177Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.3444531Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca707880>}
2025-05-07T20:33:33.3445850Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.3446857Z context = <triton._C.libtriton.ir.context object at 0x7f89ca5e76f0>
2025-05-07T20:33:33.3447146Z 
2025-05-07T20:33:33.3447309Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.3447829Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.3448425Z                            module_map=module_map)
2025-05-07T20:33:33.3448782Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.3449142Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.3449396Z E       ^
2025-05-07T20:33:33.3449852Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.3450299Z 
2025-05-07T20:33:33.3450722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.3451231Z 
2025-05-07T20:33:33.3451330Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.3451744Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.3452197Z     T=128,
2025-05-07T20:33:33.3452388Z     D=7168,
2025-05-07T20:33:33.3452582Z     scale_ub=1200.0,
2025-05-07T20:33:33.3452802Z     contiguous=False,
2025-05-07T20:33:33.3453029Z     compiled=False,
2025-05-07T20:33:33.5009407Z )
2025-05-07T20:33:33.5010542Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.5011690Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:33.5012245Z 
2025-05-07T20:33:33.5012402Z     @given(
2025-05-07T20:33:33.5012859Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.5013470Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.5014059Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.5014696Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.5015329Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.5024282Z     )
2025-05-07T20:33:33.5024651Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.5025152Z     def test_silu_mul_quant(
2025-05-07T20:33:33.5025398Z         self,
2025-05-07T20:33:33.5025594Z         T: int,
2025-05-07T20:33:33.5025790Z         D: int,
2025-05-07T20:33:33.5026009Z         scale_ub: Optional[float],
2025-05-07T20:33:33.5026273Z         contiguous: bool,
2025-05-07T20:33:33.5026500Z         compiled: bool,
2025-05-07T20:33:33.5026716Z     ) -> None:
2025-05-07T20:33:33.5026930Z         torch.manual_seed(2025)
2025-05-07T20:33:33.5027174Z     
2025-05-07T20:33:33.5027521Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.5027855Z     
2025-05-07T20:33:33.5028045Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.5028325Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.5028636Z         x = x_sign * x_clamp
2025-05-07T20:33:33.5028875Z         x0 = x[:, :D]
2025-05-07T20:33:33.5029080Z         x1 = x[:, D:]
2025-05-07T20:33:33.5029281Z     
2025-05-07T20:33:33.5029460Z         if contiguous:
2025-05-07T20:33:33.5029676Z             x0 = x0.contiguous()
2025-05-07T20:33:33.5030210Z             x1 = x1.contiguous()
2025-05-07T20:33:33.5030455Z     
2025-05-07T20:33:33.5030636Z         if scale_ub is not None:
2025-05-07T20:33:33.5030907Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.5031236Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.5031547Z             )
2025-05-07T20:33:33.5031727Z         else:
2025-05-07T20:33:33.5031936Z             scale_ub_tensor = None
2025-05-07T20:33:33.5032184Z     
2025-05-07T20:33:33.5032402Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.5032706Z             op = silu_mul_quant
2025-05-07T20:33:33.5032950Z             if compiled:
2025-05-07T20:33:33.5033184Z                 op = torch.compile(op)
2025-05-07T20:33:33.5033473Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.5033743Z     
2025-05-07T20:33:33.5033925Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.5034090Z 
2025-05-07T20:33:33.5034185Z moe/activation_test.py:117: 
2025-05-07T20:33:33.5034563Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.5034957Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.5035230Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.5035913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.5036592Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.5037132Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.5037798Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.5038455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.5039058Z     kernel = self.compile(
2025-05-07T20:33:33.5039647Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.5040724Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.5041118Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.5041340Z 
2025-05-07T20:33:33.5041542Z self = <triton.compiler.compiler.ASTSource object at 0x7f89caff54f0>
2025-05-07T20:33:33.5042605Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.5043974Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca98c7c0>}
2025-05-07T20:33:33.5045298Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.5046308Z context = <triton._C.libtriton.ir.context object at 0x7f89ca413c30>
2025-05-07T20:33:33.5046593Z 
2025-05-07T20:33:33.5046759Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.5047278Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.5047748Z                            module_map=module_map)
2025-05-07T20:33:33.5048109Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.5048457Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.5048714Z E       ^
2025-05-07T20:33:33.5049184Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.5049680Z 
2025-05-07T20:33:33.5050095Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.5050673Z 
2025-05-07T20:33:33.5050781Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.5051188Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.5051591Z     T=128,
2025-05-07T20:33:33.5051765Z     D=5120,
2025-05-07T20:33:33.5051954Z     scale_ub=None,
2025-05-07T20:33:33.5052164Z     contiguous=False,
2025-05-07T20:33:33.5052381Z     compiled=False,
2025-05-07T20:33:33.5052586Z )
2025-05-07T20:33:33.5052910Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.5053391Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:33.5053669Z 
2025-05-07T20:33:33.5053743Z     @given(
2025-05-07T20:33:33.5053965Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.5054269Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.5054566Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.5054900Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.5055385Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.5055663Z     )
2025-05-07T20:33:33.5056010Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.5056448Z     def test_silu_mul_quant(
2025-05-07T20:33:33.5056685Z         self,
2025-05-07T20:33:33.5056874Z         T: int,
2025-05-07T20:33:33.5057065Z         D: int,
2025-05-07T20:33:33.5057275Z         scale_ub: Optional[float],
2025-05-07T20:33:33.5057544Z         contiguous: bool,
2025-05-07T20:33:33.5057779Z         compiled: bool,
2025-05-07T20:33:33.5057991Z     ) -> None:
2025-05-07T20:33:33.5058210Z         torch.manual_seed(2025)
2025-05-07T20:33:33.5058445Z     
2025-05-07T20:33:33.5058709Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.5059117Z     
2025-05-07T20:33:33.5059307Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.5059593Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.5059911Z         x = x_sign * x_clamp
2025-05-07T20:33:33.5060153Z         x0 = x[:, :D]
2025-05-07T20:33:33.5060371Z         x1 = x[:, D:]
2025-05-07T20:33:33.5060576Z     
2025-05-07T20:33:33.5060760Z         if contiguous:
2025-05-07T20:33:33.5060995Z             x0 = x0.contiguous()
2025-05-07T20:33:33.5061253Z             x1 = x1.contiguous()
2025-05-07T20:33:33.5061495Z     
2025-05-07T20:33:33.5061678Z         if scale_ub is not None:
2025-05-07T20:33:33.5061948Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.5062283Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.5062581Z             )
2025-05-07T20:33:33.5062765Z         else:
2025-05-07T20:33:33.5062978Z             scale_ub_tensor = None
2025-05-07T20:33:33.5063231Z     
2025-05-07T20:33:33.5063459Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.5063768Z             op = silu_mul_quant
2025-05-07T20:33:33.5064023Z             if compiled:
2025-05-07T20:33:33.5064264Z                 op = torch.compile(op)
2025-05-07T20:33:33.5064573Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.5064846Z     
2025-05-07T20:33:33.5065037Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.5065195Z 
2025-05-07T20:33:33.5065293Z moe/activation_test.py:117: 
2025-05-07T20:33:33.5065590Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.5065914Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.5066186Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.5066883Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.5067609Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.5068156Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.5068887Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.5069554Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.5070077Z     kernel = self.compile(
2025-05-07T20:33:33.5070607Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.5071250Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.5071644Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.5071866Z 
2025-05-07T20:33:33.5072073Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cb8fbe30>
2025-05-07T20:33:33.5073136Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.5074537Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89cb8fa7a0>}
2025-05-07T20:33:33.5075890Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.5076952Z context = <triton._C.libtriton.ir.context object at 0x7f89ca42cc70>
2025-05-07T20:33:33.5077234Z 
2025-05-07T20:33:33.5077401Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.5077911Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.5078380Z                            module_map=module_map)
2025-05-07T20:33:33.5078784Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.5079126Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.5079389Z E       ^
2025-05-07T20:33:33.5079854Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.5080301Z 
2025-05-07T20:33:33.5080724Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.5081225Z 
2025-05-07T20:33:33.5081324Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.5081743Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.5082144Z     T=128,
2025-05-07T20:33:33.5082327Z     D=5120,
2025-05-07T20:33:33.5082520Z     scale_ub=1200.0,
2025-05-07T20:33:33.5082742Z     contiguous=True,
2025-05-07T20:33:33.5082960Z     compiled=False,
2025-05-07T20:33:33.5083162Z )
2025-05-07T20:33:33.5083483Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.5083956Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:33.5084234Z 
2025-05-07T20:33:33.5084311Z     @given(
2025-05-07T20:33:33.5084545Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.5084854Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.5085155Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.5085488Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.5085819Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.5086090Z     )
2025-05-07T20:33:33.5086438Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.5086882Z     def test_silu_mul_quant(
2025-05-07T20:33:33.5087112Z         self,
2025-05-07T20:33:33.5087302Z         T: int,
2025-05-07T20:33:33.5087500Z         D: int,
2025-05-07T20:33:33.5087717Z         scale_ub: Optional[float],
2025-05-07T20:33:33.5087999Z         contiguous: bool,
2025-05-07T20:33:33.5088233Z         compiled: bool,
2025-05-07T20:33:33.5088455Z     ) -> None:
2025-05-07T20:33:33.5088710Z         torch.manual_seed(2025)
2025-05-07T20:33:33.5088956Z     
2025-05-07T20:33:33.5089230Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.5089567Z     
2025-05-07T20:33:33.5089755Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.5090042Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.5090343Z         x = x_sign * x_clamp
2025-05-07T20:33:33.5090580Z         x0 = x[:, :D]
2025-05-07T20:33:33.5090795Z         x1 = x[:, D:]
2025-05-07T20:33:33.5090989Z     
2025-05-07T20:33:33.5091163Z         if contiguous:
2025-05-07T20:33:33.5091391Z             x0 = x0.contiguous()
2025-05-07T20:33:33.5091636Z             x1 = x1.contiguous()
2025-05-07T20:33:33.5091871Z     
2025-05-07T20:33:33.5092054Z         if scale_ub is not None:
2025-05-07T20:33:33.5092319Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.5092660Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.5092977Z             )
2025-05-07T20:33:33.5093175Z         else:
2025-05-07T20:33:33.5093461Z             scale_ub_tensor = None
2025-05-07T20:33:33.5093718Z     
2025-05-07T20:33:33.5093949Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.5094251Z             op = silu_mul_quant
2025-05-07T20:33:33.5094503Z             if compiled:
2025-05-07T20:33:33.5094759Z                 op = torch.compile(op)
2025-05-07T20:33:33.5095059Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.5095336Z     
2025-05-07T20:33:33.5095529Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.5095690Z 
2025-05-07T20:33:33.5095781Z moe/activation_test.py:117: 
2025-05-07T20:33:33.5096078Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.5096408Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.5096724Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.5097433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.5098118Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.5098673Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.5099340Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.5100002Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.5100536Z     kernel = self.compile(
2025-05-07T20:33:33.5101085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.5101746Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.5102154Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.5102384Z 
2025-05-07T20:33:33.5102612Z self = <triton.compiler.compiler.ASTSource object at 0x7f89f4d81520>
2025-05-07T20:33:33.5103692Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.5105058Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca480c20>}
2025-05-07T20:33:33.5106398Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.5107516Z context = <triton._C.libtriton.ir.context object at 0x7f89ca44f430>
2025-05-07T20:33:33.5107811Z 
2025-05-07T20:33:33.5107991Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.5108562Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.5109036Z                            module_map=module_map)
2025-05-07T20:33:33.5109415Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.5109778Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.5110044Z E       ^
2025-05-07T20:33:33.5110518Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.5110978Z 
2025-05-07T20:33:33.5111405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.6644965Z 
2025-05-07T20:33:33.6645923Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.6646430Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.6646882Z     T=1,
2025-05-07T20:33:33.6647068Z     D=7168,
2025-05-07T20:33:33.6647274Z     scale_ub=1200.0,
2025-05-07T20:33:33.6647516Z     contiguous=True,
2025-05-07T20:33:33.6648101Z     compiled=True,
2025-05-07T20:33:33.6648317Z )
2025-05-07T20:33:33.6648643Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.6649130Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:33.6649409Z 
2025-05-07T20:33:33.6649492Z     @given(
2025-05-07T20:33:33.6649735Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.6650085Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.6650397Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.6650733Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.6651059Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.6651341Z     )
2025-05-07T20:33:33.6651776Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.6652222Z     def test_silu_mul_quant(
2025-05-07T20:33:33.6652462Z         self,
2025-05-07T20:33:33.6652651Z         T: int,
2025-05-07T20:33:33.6652845Z         D: int,
2025-05-07T20:33:33.6653061Z         scale_ub: Optional[float],
2025-05-07T20:33:33.6653328Z         contiguous: bool,
2025-05-07T20:33:33.6653566Z         compiled: bool,
2025-05-07T20:33:33.6653787Z     ) -> None:
2025-05-07T20:33:33.6654002Z         torch.manual_seed(2025)
2025-05-07T20:33:33.6654245Z     
2025-05-07T20:33:33.6654509Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.6654843Z     
2025-05-07T20:33:33.6655031Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.6655314Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.6655625Z         x = x_sign * x_clamp
2025-05-07T20:33:33.6655861Z         x0 = x[:, :D]
2025-05-07T20:33:33.6656075Z         x1 = x[:, D:]
2025-05-07T20:33:33.6656271Z     
2025-05-07T20:33:33.6656450Z         if contiguous:
2025-05-07T20:33:33.6656677Z             x0 = x0.contiguous()
2025-05-07T20:33:33.6656925Z             x1 = x1.contiguous()
2025-05-07T20:33:33.6657161Z     
2025-05-07T20:33:33.6657344Z         if scale_ub is not None:
2025-05-07T20:33:33.6657605Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.6657931Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.6658229Z             )
2025-05-07T20:33:33.6658412Z         else:
2025-05-07T20:33:33.6658615Z             scale_ub_tensor = None
2025-05-07T20:33:33.6658862Z     
2025-05-07T20:33:33.6659084Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.6659395Z             op = silu_mul_quant
2025-05-07T20:33:33.6659636Z             if compiled:
2025-05-07T20:33:33.6659872Z                 op = torch.compile(op)
2025-05-07T20:33:33.6660163Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.6660441Z     
2025-05-07T20:33:33.6660630Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.6660801Z 
2025-05-07T20:33:33.6660898Z moe/activation_test.py:117: 
2025-05-07T20:33:33.6661277Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.6661608Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.6661876Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.6662448Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:33.6663013Z     return fn(*args, **kwargs)
2025-05-07T20:33:33.6663661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.6664340Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.6664888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.6665562Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.6666225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.6666832Z     kernel = self.compile(
2025-05-07T20:33:33.6667389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.6668111Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.6668496Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.6668726Z 
2025-05-07T20:33:33.6668926Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cacdfed0>
2025-05-07T20:33:33.6669994Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.6671504Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca481ee0>}
2025-05-07T20:33:33.6672833Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.6673831Z context = <triton._C.libtriton.ir.context object at 0x7f89cbbec2b0>
2025-05-07T20:33:33.6674116Z 
2025-05-07T20:33:33.6674282Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.6674795Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.6675255Z                            module_map=module_map)
2025-05-07T20:33:33.6675608Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.6675958Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.6676222Z E       ^
2025-05-07T20:33:33.6676676Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.6677135Z 
2025-05-07T20:33:33.6677557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.6678068Z 
2025-05-07T20:33:33.6678167Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.6678572Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.6678955Z     T=1,
2025-05-07T20:33:33.6679139Z     D=7168,
2025-05-07T20:33:33.6679329Z     scale_ub=1200.0,
2025-05-07T20:33:33.6679542Z     contiguous=False,
2025-05-07T20:33:33.6679759Z     compiled=True,
2025-05-07T20:33:33.6679962Z )
2025-05-07T20:33:33.6680270Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.6680746Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:33.6681006Z 
2025-05-07T20:33:33.6681092Z     @given(
2025-05-07T20:33:33.6681369Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.6681670Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.6681975Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.6682294Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.6682612Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.6682892Z     )
2025-05-07T20:33:33.6683245Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.6683683Z     def test_silu_mul_quant(
2025-05-07T20:33:33.6683927Z         self,
2025-05-07T20:33:33.6684122Z         T: int,
2025-05-07T20:33:33.6684316Z         D: int,
2025-05-07T20:33:33.6684533Z         scale_ub: Optional[float],
2025-05-07T20:33:33.6684805Z         contiguous: bool,
2025-05-07T20:33:33.6685042Z         compiled: bool,
2025-05-07T20:33:33.6685262Z     ) -> None:
2025-05-07T20:33:33.6685473Z         torch.manual_seed(2025)
2025-05-07T20:33:33.6685706Z     
2025-05-07T20:33:33.6685985Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.6686375Z     
2025-05-07T20:33:33.6686607Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.6686893Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.6687201Z         x = x_sign * x_clamp
2025-05-07T20:33:33.6687440Z         x0 = x[:, :D]
2025-05-07T20:33:33.6687647Z         x1 = x[:, D:]
2025-05-07T20:33:33.6687856Z     
2025-05-07T20:33:33.6688039Z         if contiguous:
2025-05-07T20:33:33.6688255Z             x0 = x0.contiguous()
2025-05-07T20:33:33.6688502Z             x1 = x1.contiguous()
2025-05-07T20:33:33.6688755Z     
2025-05-07T20:33:33.6688941Z         if scale_ub is not None:
2025-05-07T20:33:33.6689197Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.6689524Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.6689895Z             )
2025-05-07T20:33:33.6690079Z         else:
2025-05-07T20:33:33.6690283Z             scale_ub_tensor = None
2025-05-07T20:33:33.6690536Z     
2025-05-07T20:33:33.6690759Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.6691069Z             op = silu_mul_quant
2025-05-07T20:33:33.6691317Z             if compiled:
2025-05-07T20:33:33.6691554Z                 op = torch.compile(op)
2025-05-07T20:33:33.6691847Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.6692115Z     
2025-05-07T20:33:33.6692297Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.6692463Z 
2025-05-07T20:33:33.6692556Z moe/activation_test.py:117: 
2025-05-07T20:33:33.6692845Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.6693171Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.6693442Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.6703932Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:33.6704606Z     return fn(*args, **kwargs)
2025-05-07T20:33:33.6705295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.6705999Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.6706559Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.6707249Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.6707990Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.6708532Z     kernel = self.compile(
2025-05-07T20:33:33.6709090Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.6709740Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.6710147Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.6710387Z 
2025-05-07T20:33:33.6710679Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cb25e550>
2025-05-07T20:33:33.6711780Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.6713143Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca482c00>}
2025-05-07T20:33:33.6714470Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.6715496Z context = <triton._C.libtriton.ir.context object at 0x7f89cbb05370>
2025-05-07T20:33:33.6715784Z 
2025-05-07T20:33:33.6715962Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.6716530Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.6717035Z                            module_map=module_map)
2025-05-07T20:33:33.6717405Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.6717761Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.6718027Z E       ^
2025-05-07T20:33:33.6718495Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.6719080Z 
2025-05-07T20:33:33.6719730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.8754680Z 
2025-05-07T20:33:33.8755213Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.8755800Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.8756221Z     T=1,
2025-05-07T20:33:33.8756466Z     D=7168,
2025-05-07T20:33:33.8756671Z     scale_ub=None,
2025-05-07T20:33:33.8756904Z     contiguous=False,
2025-05-07T20:33:33.8757131Z     compiled=True,
2025-05-07T20:33:33.8757339Z )
2025-05-07T20:33:33.8757661Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.8758145Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:33.8758398Z 
2025-05-07T20:33:33.8758486Z     @given(
2025-05-07T20:33:33.8758722Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.8759030Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.8759337Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.8759667Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.8759996Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.8760291Z     )
2025-05-07T20:33:33.8760638Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.8761097Z     def test_silu_mul_quant(
2025-05-07T20:33:33.8761339Z         self,
2025-05-07T20:33:33.8761537Z         T: int,
2025-05-07T20:33:33.8761735Z         D: int,
2025-05-07T20:33:33.8761964Z         scale_ub: Optional[float],
2025-05-07T20:33:33.8762237Z         contiguous: bool,
2025-05-07T20:33:33.8762478Z         compiled: bool,
2025-05-07T20:33:33.8762705Z     ) -> None:
2025-05-07T20:33:33.8762911Z         torch.manual_seed(2025)
2025-05-07T20:33:33.8763151Z     
2025-05-07T20:33:33.8763423Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.8763757Z     
2025-05-07T20:33:33.8763953Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.8764240Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.8764544Z         x = x_sign * x_clamp
2025-05-07T20:33:33.8764779Z         x0 = x[:, :D]
2025-05-07T20:33:33.8764991Z         x1 = x[:, D:]
2025-05-07T20:33:33.8765194Z     
2025-05-07T20:33:33.8765389Z         if contiguous:
2025-05-07T20:33:33.8765696Z             x0 = x0.contiguous()
2025-05-07T20:33:33.8765955Z             x1 = x1.contiguous()
2025-05-07T20:33:33.8766195Z     
2025-05-07T20:33:33.8766385Z         if scale_ub is not None:
2025-05-07T20:33:33.8766647Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.8766979Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.8767290Z             )
2025-05-07T20:33:33.8767484Z         else:
2025-05-07T20:33:33.8767685Z             scale_ub_tensor = None
2025-05-07T20:33:33.8767939Z     
2025-05-07T20:33:33.8768165Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.8768461Z             op = silu_mul_quant
2025-05-07T20:33:33.8768702Z             if compiled:
2025-05-07T20:33:33.8768943Z                 op = torch.compile(op)
2025-05-07T20:33:33.8769235Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.8769501Z     
2025-05-07T20:33:33.8769684Z         y_fp8, y_scale = fn()
2025-05-07T20:33:33.8769960Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:33.8770310Z     
2025-05-07T20:33:33.8770595Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.8770916Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:33.8771196Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:33.8771500Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:33.8771847Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:33.8772142Z     
2025-05-07T20:33:33.8772337Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:33.8772526Z 
2025-05-07T20:33:33.8772627Z moe/activation_test.py:126: 
2025-05-07T20:33:33.8772908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.8773230Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:33.8773594Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:33.8774392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:33.8775124Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:33.8775668Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.8776333Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.8777008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:33.8777710Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:33.8778428Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:33.8779054Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:33.8779641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:33.8780146Z     fn()
2025-05-07T20:33:33.8780653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:33.8781232Z     self.fn.run(
2025-05-07T20:33:33.8781691Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.8782202Z     kernel = self.compile(
2025-05-07T20:33:33.8782744Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.8783396Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.8783776Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.8783996Z 
2025-05-07T20:33:33.8784203Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cacdf8d0>
2025-05-07T20:33:33.8785309Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.8786656Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca6f4180>}
2025-05-07T20:33:33.8788068Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.8789114Z context = <triton._C.libtriton.ir.context object at 0x7f89ca637ab0>
2025-05-07T20:33:33.8789393Z 
2025-05-07T20:33:33.8789583Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.8790112Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.8790579Z                            module_map=module_map)
2025-05-07T20:33:33.8791056Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.8791403Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:33.8791660Z E       ^
2025-05-07T20:33:33.8792126Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.8792571Z 
2025-05-07T20:33:33.8792995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:33.8793497Z 
2025-05-07T20:33:33.8793604Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:33.8794000Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:33.8794391Z     T=1,
2025-05-07T20:33:33.8794614Z     D=5120,
2025-05-07T20:33:33.8794794Z     scale_ub=1200.0,
2025-05-07T20:33:33.8795010Z     contiguous=False,
2025-05-07T20:33:33.8795227Z     compiled=True,
2025-05-07T20:33:33.8795424Z )
2025-05-07T20:33:33.8795734Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:33.8796204Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:33.8796467Z 
2025-05-07T20:33:33.8796546Z     @given(
2025-05-07T20:33:33.8796764Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:33.8797062Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:33.8797355Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:33.8797667Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:33.8797987Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:33.8798259Z     )
2025-05-07T20:33:33.8798593Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:33.8799030Z     def test_silu_mul_quant(
2025-05-07T20:33:33.8799265Z         self,
2025-05-07T20:33:33.8799448Z         T: int,
2025-05-07T20:33:33.8799650Z         D: int,
2025-05-07T20:33:33.8799898Z         scale_ub: Optional[float],
2025-05-07T20:33:33.8800165Z         contiguous: bool,
2025-05-07T20:33:33.8800393Z         compiled: bool,
2025-05-07T20:33:33.8800603Z     ) -> None:
2025-05-07T20:33:33.8800804Z         torch.manual_seed(2025)
2025-05-07T20:33:33.8801034Z     
2025-05-07T20:33:33.8801293Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:33.8801623Z     
2025-05-07T20:33:33.8801800Z         x_sign = torch.sign(x)
2025-05-07T20:33:33.8802076Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:33.8802373Z         x = x_sign * x_clamp
2025-05-07T20:33:33.8802597Z         x0 = x[:, :D]
2025-05-07T20:33:33.8802803Z         x1 = x[:, D:]
2025-05-07T20:33:33.8802998Z     
2025-05-07T20:33:33.8803167Z         if contiguous:
2025-05-07T20:33:33.8803385Z             x0 = x0.contiguous()
2025-05-07T20:33:33.8803630Z             x1 = x1.contiguous()
2025-05-07T20:33:33.8803855Z     
2025-05-07T20:33:33.8804080Z         if scale_ub is not None:
2025-05-07T20:33:33.8804347Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:33.8804665Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:33.8804969Z             )
2025-05-07T20:33:33.8805149Z         else:
2025-05-07T20:33:33.8805345Z             scale_ub_tensor = None
2025-05-07T20:33:33.8805585Z     
2025-05-07T20:33:33.8805803Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:33.8806099Z             op = silu_mul_quant
2025-05-07T20:33:33.8806333Z             if compiled:
2025-05-07T20:33:33.8806566Z                 op = torch.compile(op)
2025-05-07T20:33:33.8806853Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.8807110Z     
2025-05-07T20:33:33.8807294Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:33.8807455Z 
2025-05-07T20:33:33.8807552Z moe/activation_test.py:117: 
2025-05-07T20:33:33.8807835Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.8808152Z moe/activation_test.py:115: in fn
2025-05-07T20:33:33.8808515Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:33.8809077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:33.8809633Z     return fn(*args, **kwargs)
2025-05-07T20:33:33.8810278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:33.8810946Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:33.8811484Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:33.8812149Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:33.8812854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:33.8813370Z     kernel = self.compile(
2025-05-07T20:33:33.8813905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:33.8814539Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:33.8814921Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:33.8815139Z 
2025-05-07T20:33:33.8815343Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cb358ad0>
2025-05-07T20:33:33.8816393Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:33.8817734Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca6f5300>}
2025-05-07T20:33:33.8819051Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:33.8820103Z context = <triton._C.libtriton.ir.context object at 0x7f89ca6814b0>
2025-05-07T20:33:33.8820390Z 
2025-05-07T20:33:33.8820551Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:33.8821062Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:33.8821513Z                            module_map=module_map)
2025-05-07T20:33:33.8821872Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:33.8822214Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:33.8822464Z E       ^
2025-05-07T20:33:33.8822916Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:33.8823373Z 
2025-05-07T20:33:33.8823834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:34.0221761Z 
2025-05-07T20:33:34.0222015Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:34.0222479Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:34.0222884Z     T=1,
2025-05-07T20:33:34.0223121Z     D=5120,
2025-05-07T20:33:34.0223319Z     scale_ub=1200.0,
2025-05-07T20:33:34.0223540Z     contiguous=False,
2025-05-07T20:33:34.0223774Z     compiled=False,
2025-05-07T20:33:34.0223982Z )
2025-05-07T20:33:34.0224293Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:34.0224779Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:34.0225046Z 
2025-05-07T20:33:34.0225138Z     @given(
2025-05-07T20:33:34.0225364Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:34.0225672Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:34.0225981Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:34.0226464Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:34.0226786Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:34.0227073Z     )
2025-05-07T20:33:34.0227486Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:34.0227935Z     def test_silu_mul_quant(
2025-05-07T20:33:34.0228180Z         self,
2025-05-07T20:33:34.0228373Z         T: int,
2025-05-07T20:33:34.0228563Z         D: int,
2025-05-07T20:33:34.0228784Z         scale_ub: Optional[float],
2025-05-07T20:33:34.0229054Z         contiguous: bool,
2025-05-07T20:33:34.0229289Z         compiled: bool,
2025-05-07T20:33:34.0229518Z     ) -> None:
2025-05-07T20:33:34.0229764Z         torch.manual_seed(2025)
2025-05-07T20:33:34.0230087Z     
2025-05-07T20:33:34.0230354Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:34.0230694Z     
2025-05-07T20:33:34.0230887Z         x_sign = torch.sign(x)
2025-05-07T20:33:34.0231179Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:34.0231490Z         x = x_sign * x_clamp
2025-05-07T20:33:34.0231729Z         x0 = x[:, :D]
2025-05-07T20:33:34.0231944Z         x1 = x[:, D:]
2025-05-07T20:33:34.0232156Z     
2025-05-07T20:33:34.0232346Z         if contiguous:
2025-05-07T20:33:34.0232578Z             x0 = x0.contiguous()
2025-05-07T20:33:34.0232836Z             x1 = x1.contiguous()
2025-05-07T20:33:34.0233083Z     
2025-05-07T20:33:34.0233273Z         if scale_ub is not None:
2025-05-07T20:33:34.0233548Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:34.0233883Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:34.0234193Z             )
2025-05-07T20:33:34.0234396Z         else:
2025-05-07T20:33:34.0234616Z             scale_ub_tensor = None
2025-05-07T20:33:34.0234868Z     
2025-05-07T20:33:34.0235105Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:34.0235421Z             op = silu_mul_quant
2025-05-07T20:33:34.0235678Z             if compiled:
2025-05-07T20:33:34.0235931Z                 op = torch.compile(op)
2025-05-07T20:33:34.0236225Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.0236506Z     
2025-05-07T20:33:34.0236697Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:34.0236864Z 
2025-05-07T20:33:34.0236963Z moe/activation_test.py:117: 
2025-05-07T20:33:34.0237259Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.0237582Z moe/activation_test.py:115: in fn
2025-05-07T20:33:34.0237876Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.0238560Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:34.0239245Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:34.0239866Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:34.0240736Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:34.0241403Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:34.0241919Z     kernel = self.compile(
2025-05-07T20:33:34.0242463Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:34.0243113Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:34.0243503Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.0243724Z 
2025-05-07T20:33:34.0243928Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cb25f750>
2025-05-07T20:33:34.0244996Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:34.0247105Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca6f6020>}
2025-05-07T20:33:34.0248471Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:34.0249473Z context = <triton._C.libtriton.ir.context object at 0x7f89ca6f17f0>
2025-05-07T20:33:34.0249760Z 
2025-05-07T20:33:34.0249924Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:34.0250435Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:34.0250961Z                            module_map=module_map)
2025-05-07T20:33:34.0251318Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:34.0251676Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:34.0251942Z E       ^
2025-05-07T20:33:34.0252400Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:34.0252855Z 
2025-05-07T20:33:34.0253272Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:34.0253775Z 
2025-05-07T20:33:34.0253875Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:34.0254289Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:34.0254685Z     T=16384,
2025-05-07T20:33:34.0254882Z     D=5120,
2025-05-07T20:33:34.0255077Z     scale_ub=1200.0,
2025-05-07T20:33:34.0255306Z     contiguous=False,
2025-05-07T20:33:34.0255537Z     compiled=True,
2025-05-07T20:33:34.0255743Z )
2025-05-07T20:33:34.0256056Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:34.0256553Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:34.0256838Z 
2025-05-07T20:33:34.0256917Z     @given(
2025-05-07T20:33:34.0257143Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:34.0257449Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:34.0257751Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:34.0258082Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:34.0258405Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:34.0258693Z     )
2025-05-07T20:33:34.0259048Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:34.0259492Z     def test_silu_mul_quant(
2025-05-07T20:33:34.0259735Z         self,
2025-05-07T20:33:34.0259938Z         T: int,
2025-05-07T20:33:34.0260134Z         D: int,
2025-05-07T20:33:34.0260347Z         scale_ub: Optional[float],
2025-05-07T20:33:34.0260688Z         contiguous: bool,
2025-05-07T20:33:34.0260928Z         compiled: bool,
2025-05-07T20:33:34.0261149Z     ) -> None:
2025-05-07T20:33:34.0261363Z         torch.manual_seed(2025)
2025-05-07T20:33:34.0261604Z     
2025-05-07T20:33:34.0261865Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:34.0262209Z     
2025-05-07T20:33:34.0262401Z         x_sign = torch.sign(x)
2025-05-07T20:33:34.0262684Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:34.0262996Z         x = x_sign * x_clamp
2025-05-07T20:33:34.0263236Z         x0 = x[:, :D]
2025-05-07T20:33:34.0263451Z         x1 = x[:, D:]
2025-05-07T20:33:34.0263661Z     
2025-05-07T20:33:34.0263853Z         if contiguous:
2025-05-07T20:33:34.0264080Z             x0 = x0.contiguous()
2025-05-07T20:33:34.0264335Z             x1 = x1.contiguous()
2025-05-07T20:33:34.0264576Z     
2025-05-07T20:33:34.0264761Z         if scale_ub is not None:
2025-05-07T20:33:34.0265032Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:34.0265364Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:34.0265779Z             )
2025-05-07T20:33:34.0265975Z         else:
2025-05-07T20:33:34.0266184Z             scale_ub_tensor = None
2025-05-07T20:33:34.0266434Z     
2025-05-07T20:33:34.0266657Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:34.0266966Z             op = silu_mul_quant
2025-05-07T20:33:34.0267217Z             if compiled:
2025-05-07T20:33:34.0267523Z                 op = torch.compile(op)
2025-05-07T20:33:34.0267817Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.0268093Z     
2025-05-07T20:33:34.0268282Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:34.0268447Z 
2025-05-07T20:33:34.0268545Z moe/activation_test.py:117: 
2025-05-07T20:33:34.0268842Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.0269216Z moe/activation_test.py:115: in fn
2025-05-07T20:33:34.0269495Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.0270069Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:34.0270634Z     return fn(*args, **kwargs)
2025-05-07T20:33:34.0271286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:34.0271963Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:34.0272500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:34.0273169Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:34.0273821Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:34.0274355Z     kernel = self.compile(
2025-05-07T20:33:34.0274912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:34.0275583Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:34.0275975Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.0276207Z 
2025-05-07T20:33:34.0276409Z self = <triton.compiler.compiler.ASTSource object at 0x7f8819f227d0>
2025-05-07T20:33:34.0277484Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:34.0278836Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca6f7600>}
2025-05-07T20:33:34.0280249Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:34.0281256Z context = <triton._C.libtriton.ir.context object at 0x7f89ca0a5670>
2025-05-07T20:33:34.0281540Z 
2025-05-07T20:33:34.0281709Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:34.0282226Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:34.0282686Z                            module_map=module_map)
2025-05-07T20:33:34.0283047Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:34.0283412Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:34.0283662Z E       ^
2025-05-07T20:33:34.0290045Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:34.0290492Z 
2025-05-07T20:33:34.0290922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:34.0291427Z 
2025-05-07T20:33:34.0291527Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:34.0292044Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:34.0292449Z     T=2048,
2025-05-07T20:33:34.0292640Z     D=7168,
2025-05-07T20:33:34.0292827Z     scale_ub=1200.0,
2025-05-07T20:33:34.0293056Z     contiguous=False,
2025-05-07T20:33:34.0293277Z     compiled=True,
2025-05-07T20:33:34.2145433Z )
2025-05-07T20:33:34.2145759Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:34.2146290Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:34.2146650Z 
2025-05-07T20:33:34.2146731Z     @given(
2025-05-07T20:33:34.2146963Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:34.2147453Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:34.2147754Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:34.2148074Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:34.2148393Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:34.2148712Z     )
2025-05-07T20:33:34.2149054Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:34.2149501Z     def test_silu_mul_quant(
2025-05-07T20:33:34.2149739Z         self,
2025-05-07T20:33:34.2149929Z         T: int,
2025-05-07T20:33:34.2150125Z         D: int,
2025-05-07T20:33:34.2150336Z         scale_ub: Optional[float],
2025-05-07T20:33:34.2150603Z         contiguous: bool,
2025-05-07T20:33:34.2150835Z         compiled: bool,
2025-05-07T20:33:34.2151047Z     ) -> None:
2025-05-07T20:33:34.2151252Z         torch.manual_seed(2025)
2025-05-07T20:33:34.2151486Z     
2025-05-07T20:33:34.2151745Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:34.2152086Z     
2025-05-07T20:33:34.2152273Z         x_sign = torch.sign(x)
2025-05-07T20:33:34.2152554Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:34.2152859Z         x = x_sign * x_clamp
2025-05-07T20:33:34.2153103Z         x0 = x[:, :D]
2025-05-07T20:33:34.2153314Z         x1 = x[:, D:]
2025-05-07T20:33:34.2153508Z     
2025-05-07T20:33:34.2153684Z         if contiguous:
2025-05-07T20:33:34.2153907Z             x0 = x0.contiguous()
2025-05-07T20:33:34.2154153Z             x1 = x1.contiguous()
2025-05-07T20:33:34.2154383Z     
2025-05-07T20:33:34.2154561Z         if scale_ub is not None:
2025-05-07T20:33:34.2154817Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:34.2155143Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:34.2155442Z             )
2025-05-07T20:33:34.2155623Z         else:
2025-05-07T20:33:34.2155822Z             scale_ub_tensor = None
2025-05-07T20:33:34.2156066Z     
2025-05-07T20:33:34.2156288Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:34.2156590Z             op = silu_mul_quant
2025-05-07T20:33:34.2156833Z             if compiled:
2025-05-07T20:33:34.2157143Z                 op = torch.compile(op)
2025-05-07T20:33:34.2157442Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.2157708Z     
2025-05-07T20:33:34.2157887Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:34.2158049Z 
2025-05-07T20:33:34.2158145Z moe/activation_test.py:117: 
2025-05-07T20:33:34.2158435Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.2158754Z moe/activation_test.py:115: in fn
2025-05-07T20:33:34.2159023Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.2159585Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:34.2160129Z     return fn(*args, **kwargs)
2025-05-07T20:33:34.2160768Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:34.2161437Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:34.2162026Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:34.2162765Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:34.2163406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:34.2163919Z     kernel = self.compile(
2025-05-07T20:33:34.2164450Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:34.2165086Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:34.2165472Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.2165694Z 
2025-05-07T20:33:34.2165894Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cb3c4dd0>
2025-05-07T20:33:34.2167002Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:34.2168374Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca038720>}
2025-05-07T20:33:34.2169719Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:34.2170720Z context = <triton._C.libtriton.ir.context object at 0x7f89ca054bb0>
2025-05-07T20:33:34.2171001Z 
2025-05-07T20:33:34.2171161Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:34.2171671Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:34.2172134Z                            module_map=module_map)
2025-05-07T20:33:34.2172486Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:34.2172838Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:34.2173085Z E       ^
2025-05-07T20:33:34.2173540Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:34.2173987Z 
2025-05-07T20:33:34.2174399Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:34.2174894Z 
2025-05-07T20:33:34.2174997Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:34.2175389Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:34.2175777Z     T=1,
2025-05-07T20:33:34.2175950Z     D=5120,
2025-05-07T20:33:34.2176135Z     scale_ub=None,
2025-05-07T20:33:34.2176342Z     contiguous=False,
2025-05-07T20:33:34.2176564Z     compiled=False,
2025-05-07T20:33:34.2176767Z )
2025-05-07T20:33:34.2177126Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:34.2177611Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:34.2177863Z 
2025-05-07T20:33:34.2177945Z     @given(
2025-05-07T20:33:34.2178162Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:34.2178469Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:34.2178766Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:34.2179080Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:34.2179400Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:34.2179677Z     )
2025-05-07T20:33:34.2180044Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:34.2180496Z     def test_silu_mul_quant(
2025-05-07T20:33:34.2180735Z         self,
2025-05-07T20:33:34.2180929Z         T: int,
2025-05-07T20:33:34.2181118Z         D: int,
2025-05-07T20:33:34.2181330Z         scale_ub: Optional[float],
2025-05-07T20:33:34.2181590Z         contiguous: bool,
2025-05-07T20:33:34.2181901Z         compiled: bool,
2025-05-07T20:33:34.2182122Z     ) -> None:
2025-05-07T20:33:34.2182333Z         torch.manual_seed(2025)
2025-05-07T20:33:34.2182562Z     
2025-05-07T20:33:34.2182819Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:34.2183148Z     
2025-05-07T20:33:34.2183330Z         x_sign = torch.sign(x)
2025-05-07T20:33:34.2183613Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:34.2183912Z         x = x_sign * x_clamp
2025-05-07T20:33:34.2184139Z         x0 = x[:, :D]
2025-05-07T20:33:34.2184345Z         x1 = x[:, D:]
2025-05-07T20:33:34.2184545Z     
2025-05-07T20:33:34.2184719Z         if contiguous:
2025-05-07T20:33:34.2184944Z             x0 = x0.contiguous()
2025-05-07T20:33:34.2185245Z             x1 = x1.contiguous()
2025-05-07T20:33:34.2185474Z     
2025-05-07T20:33:34.2185652Z         if scale_ub is not None:
2025-05-07T20:33:34.2185919Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:34.2186249Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:34.2186547Z             )
2025-05-07T20:33:34.2186735Z         else:
2025-05-07T20:33:34.2186938Z             scale_ub_tensor = None
2025-05-07T20:33:34.2187179Z     
2025-05-07T20:33:34.2187462Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:34.2187763Z             op = silu_mul_quant
2025-05-07T20:33:34.2188000Z             if compiled:
2025-05-07T20:33:34.2188238Z                 op = torch.compile(op)
2025-05-07T20:33:34.2188524Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.2188786Z     
2025-05-07T20:33:34.2188970Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:34.2189129Z 
2025-05-07T20:33:34.2189228Z moe/activation_test.py:117: 
2025-05-07T20:33:34.2189516Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.2189834Z moe/activation_test.py:115: in fn
2025-05-07T20:33:34.2190127Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.2190847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:34.2191514Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:34.2192054Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:34.2192718Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:34.2193362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:34.2193873Z     kernel = self.compile(
2025-05-07T20:33:34.2194418Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:34.2195054Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:34.2195511Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.2195741Z 
2025-05-07T20:33:34.2195940Z self = <triton.compiler.compiler.ASTSource object at 0x7f89ca7181d0>
2025-05-07T20:33:34.2196993Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:34.2198330Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca039120>}
2025-05-07T20:33:34.2199638Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:34.2200638Z context = <triton._C.libtriton.ir.context object at 0x7f89caaf0170>
2025-05-07T20:33:34.2200920Z 
2025-05-07T20:33:34.2201158Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:34.2201665Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:34.2202121Z                            module_map=module_map)
2025-05-07T20:33:34.2202467Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:34.2202809Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:34.2203058Z E       ^
2025-05-07T20:33:34.2203503Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:34.2203960Z 
2025-05-07T20:33:34.2204375Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:34.2204926Z 
2025-05-07T20:33:34.2205027Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:34.2205434Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:34.2205830Z     T=4096,
2025-05-07T20:33:34.2206021Z     D=7168,
2025-05-07T20:33:34.2206214Z     scale_ub=1200.0,
2025-05-07T20:33:34.2206432Z     contiguous=False,
2025-05-07T20:33:34.2206658Z     compiled=False,
2025-05-07T20:33:34.2206863Z )
2025-05-07T20:33:34.2207181Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:34.2207667Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:34.2207935Z 
2025-05-07T20:33:34.2208020Z     @given(
2025-05-07T20:33:34.2208239Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:34.2208544Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:34.2208846Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:34.2209172Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:34.2209489Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:34.2209772Z     )
2025-05-07T20:33:34.2210121Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:34.2210551Z     def test_silu_mul_quant(
2025-05-07T20:33:34.2210796Z         self,
2025-05-07T20:33:34.2210990Z         T: int,
2025-05-07T20:33:34.2211181Z         D: int,
2025-05-07T20:33:34.2211405Z         scale_ub: Optional[float],
2025-05-07T20:33:34.2211674Z         contiguous: bool,
2025-05-07T20:33:34.2211907Z         compiled: bool,
2025-05-07T20:33:34.2212134Z     ) -> None:
2025-05-07T20:33:34.2212343Z         torch.manual_seed(2025)
2025-05-07T20:33:34.2212578Z     
2025-05-07T20:33:34.2212853Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:34.2213191Z     
2025-05-07T20:33:34.2213384Z         x_sign = torch.sign(x)
2025-05-07T20:33:34.2213669Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:34.2213979Z         x = x_sign * x_clamp
2025-05-07T20:33:34.2214214Z         x0 = x[:, :D]
2025-05-07T20:33:34.2214493Z         x1 = x[:, D:]
2025-05-07T20:33:34.2214705Z     
2025-05-07T20:33:34.2214891Z         if contiguous:
2025-05-07T20:33:34.2215110Z             x0 = x0.contiguous()
2025-05-07T20:33:34.2215363Z             x1 = x1.contiguous()
2025-05-07T20:33:34.2215599Z     
2025-05-07T20:33:34.2215789Z         if scale_ub is not None:
2025-05-07T20:33:34.2216063Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:34.2216393Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:34.2216689Z             )
2025-05-07T20:33:34.2216884Z         else:
2025-05-07T20:33:34.2217093Z             scale_ub_tensor = None
2025-05-07T20:33:34.2217335Z     
2025-05-07T20:33:34.2217564Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:34.2217874Z             op = silu_mul_quant
2025-05-07T20:33:34.2218132Z             if compiled:
2025-05-07T20:33:34.2218375Z                 op = torch.compile(op)
2025-05-07T20:33:34.2218666Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.2218944Z     
2025-05-07T20:33:34.2219217Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:34.2219385Z 
2025-05-07T20:33:34.2219482Z moe/activation_test.py:117: 
2025-05-07T20:33:34.2219772Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.2220093Z moe/activation_test.py:115: in fn
2025-05-07T20:33:34.2220375Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.2221059Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:34.2221733Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:34.2222280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:34.2222944Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:34.2223642Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:34.2224158Z     kernel = self.compile(
2025-05-07T20:33:34.2224696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:34.2225334Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:34.2225728Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.2225947Z 
2025-05-07T20:33:34.2226148Z self = <triton.compiler.compiler.ASTSource object at 0x7f8819f23150>
2025-05-07T20:33:34.2227197Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:34.2228595Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca03a480>}
2025-05-07T20:33:34.2229953Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:34.2231005Z context = <triton._C.libtriton.ir.context object at 0x7f89caa034b0>
2025-05-07T20:33:34.2231284Z 
2025-05-07T20:33:34.2231446Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:34.2231964Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:34.2232435Z                            module_map=module_map)
2025-05-07T20:33:34.2232792Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:34.2233141Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:34.2233398Z E       ^
2025-05-07T20:33:34.2233902Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:34.2234351Z 
2025-05-07T20:33:34.2234767Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:34.3780685Z 
2025-05-07T20:33:34.3780890Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:34.3781315Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:34.3781759Z     T=16384,
2025-05-07T20:33:34.3781953Z     D=7168,
2025-05-07T20:33:34.3782138Z     scale_ub=None,
2025-05-07T20:33:34.3782349Z     contiguous=True,
2025-05-07T20:33:34.3782568Z     compiled=True,
2025-05-07T20:33:34.3782765Z )
2025-05-07T20:33:34.3783076Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:34.3783555Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:34.3783821Z 
2025-05-07T20:33:34.3783905Z     @given(
2025-05-07T20:33:34.3784126Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:34.3784432Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:34.3784882Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:34.3785204Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:34.3785522Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:34.3785804Z     )
2025-05-07T20:33:34.3786137Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:34.3786583Z     def test_silu_mul_quant(
2025-05-07T20:33:34.3786820Z         self,
2025-05-07T20:33:34.3787010Z         T: int,
2025-05-07T20:33:34.3787200Z         D: int,
2025-05-07T20:33:34.3787499Z         scale_ub: Optional[float],
2025-05-07T20:33:34.3787771Z         contiguous: bool,
2025-05-07T20:33:34.3787997Z         compiled: bool,
2025-05-07T20:33:34.3788291Z     ) -> None:
2025-05-07T20:33:34.3788497Z         torch.manual_seed(2025)
2025-05-07T20:33:34.3788727Z     
2025-05-07T20:33:34.3788991Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:34.3789331Z     
2025-05-07T20:33:34.3789516Z         x_sign = torch.sign(x)
2025-05-07T20:33:34.3789801Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:34.3790112Z         x = x_sign * x_clamp
2025-05-07T20:33:34.3790346Z         x0 = x[:, :D]
2025-05-07T20:33:34.3790559Z         x1 = x[:, D:]
2025-05-07T20:33:34.3790759Z     
2025-05-07T20:33:34.3790935Z         if contiguous:
2025-05-07T20:33:34.3791161Z             x0 = x0.contiguous()
2025-05-07T20:33:34.3791414Z             x1 = x1.contiguous()
2025-05-07T20:33:34.3791646Z     
2025-05-07T20:33:34.3791829Z         if scale_ub is not None:
2025-05-07T20:33:34.3792099Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:34.3792431Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:34.3792730Z             )
2025-05-07T20:33:34.3792923Z         else:
2025-05-07T20:33:34.3793129Z             scale_ub_tensor = None
2025-05-07T20:33:34.3793377Z     
2025-05-07T20:33:34.3793607Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:34.3793923Z             op = silu_mul_quant
2025-05-07T20:33:34.3794162Z             if compiled:
2025-05-07T20:33:34.3794411Z                 op = torch.compile(op)
2025-05-07T20:33:34.3794699Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.3794966Z     
2025-05-07T20:33:34.3795159Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:34.3795318Z 
2025-05-07T20:33:34.3795418Z moe/activation_test.py:117: 
2025-05-07T20:33:34.3795704Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.3796026Z moe/activation_test.py:115: in fn
2025-05-07T20:33:34.3796299Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.3796855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:34.3797399Z     return fn(*args, **kwargs)
2025-05-07T20:33:34.3798120Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:34.3798794Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:34.3799333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:34.3799994Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:34.3800644Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:34.3801168Z     kernel = self.compile(
2025-05-07T20:33:34.3801712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:34.3802353Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:34.3802752Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.3802977Z 
2025-05-07T20:33:34.3803187Z self = <triton.compiler.compiler.ASTSource object at 0x7f89ca718450>
2025-05-07T20:33:34.3804359Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:34.3805704Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca03b740>}
2025-05-07T20:33:34.3807012Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:34.3808078Z context = <triton._C.libtriton.ir.context object at 0x7f89caa33f70>
2025-05-07T20:33:34.3808357Z 
2025-05-07T20:33:34.3808522Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:34.3809032Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:34.3809496Z                            module_map=module_map)
2025-05-07T20:33:34.3809856Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:34.3810195Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:34.3810462Z E       ^
2025-05-07T20:33:34.3810929Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:34.3811388Z 
2025-05-07T20:33:34.3811806Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:34.3812303Z 
2025-05-07T20:33:34.3812404Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:34.3812810Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:34.3813197Z     T=4096,
2025-05-07T20:33:34.3813376Z     D=5120,
2025-05-07T20:33:34.3813564Z     scale_ub=None,
2025-05-07T20:33:34.3813780Z     contiguous=False,
2025-05-07T20:33:34.3814000Z     compiled=True,
2025-05-07T20:33:34.3814200Z )
2025-05-07T20:33:34.3814514Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:34.3814990Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:34.3815254Z 
2025-05-07T20:33:34.3815332Z     @given(
2025-05-07T20:33:34.3815556Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:34.3815855Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:34.3816148Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:34.3816468Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:34.3816786Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:34.3817071Z     )
2025-05-07T20:33:34.3817419Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:34.3817894Z     def test_silu_mul_quant(
2025-05-07T20:33:34.3818136Z         self,
2025-05-07T20:33:34.3818329Z         T: int,
2025-05-07T20:33:34.3818517Z         D: int,
2025-05-07T20:33:34.3818730Z         scale_ub: Optional[float],
2025-05-07T20:33:34.3818988Z         contiguous: bool,
2025-05-07T20:33:34.3819225Z         compiled: bool,
2025-05-07T20:33:34.3819441Z     ) -> None:
2025-05-07T20:33:34.3819647Z         torch.manual_seed(2025)
2025-05-07T20:33:34.3819883Z     
2025-05-07T20:33:34.3820143Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:34.3820474Z     
2025-05-07T20:33:34.3820659Z         x_sign = torch.sign(x)
2025-05-07T20:33:34.3820943Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:34.3821241Z         x = x_sign * x_clamp
2025-05-07T20:33:34.3821479Z         x0 = x[:, :D]
2025-05-07T20:33:34.3821698Z         x1 = x[:, D:]
2025-05-07T20:33:34.3821896Z     
2025-05-07T20:33:34.3822076Z         if contiguous:
2025-05-07T20:33:34.3822306Z             x0 = x0.contiguous()
2025-05-07T20:33:34.3822638Z             x1 = x1.contiguous()
2025-05-07T20:33:34.3822871Z     
2025-05-07T20:33:34.3823059Z         if scale_ub is not None:
2025-05-07T20:33:34.3823317Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:34.3823643Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:34.3823943Z             )
2025-05-07T20:33:34.3824129Z         else:
2025-05-07T20:33:34.3824327Z             scale_ub_tensor = None
2025-05-07T20:33:34.3824580Z     
2025-05-07T20:33:34.3830211Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:34.3830560Z             op = silu_mul_quant
2025-05-07T20:33:34.3830805Z             if compiled:
2025-05-07T20:33:34.3831052Z                 op = torch.compile(op)
2025-05-07T20:33:34.3831416Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.3831683Z     
2025-05-07T20:33:34.3831871Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:34.3832038Z 
2025-05-07T20:33:34.3832137Z moe/activation_test.py:117: 
2025-05-07T20:33:34.3832434Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.3832762Z moe/activation_test.py:115: in fn
2025-05-07T20:33:34.3833037Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.3833593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:34.3834141Z     return fn(*args, **kwargs)
2025-05-07T20:33:34.3834793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:34.3835458Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:34.3835988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:34.3836660Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:34.3837311Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:34.3837831Z     kernel = self.compile(
2025-05-07T20:33:34.3838378Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:34.3839024Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:34.3839404Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.3839626Z 
2025-05-07T20:33:34.3839828Z self = <triton.compiler.compiler.ASTSource object at 0x7f89ca49b4d0>
2025-05-07T20:33:34.3841191Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:34.3842628Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca254c20>}
2025-05-07T20:33:34.3844293Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:34.3845533Z context = <triton._C.libtriton.ir.context object at 0x7f89ca2c77f0>
2025-05-07T20:33:34.3845871Z 
2025-05-07T20:33:34.3846053Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:34.3846658Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:34.3847197Z                            module_map=module_map)
2025-05-07T20:33:34.3847594Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:34.3847990Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:34.3848271Z E       ^
2025-05-07T20:33:34.3848804Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:34.3849378Z 
2025-05-07T20:33:34.3849808Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:34.5209480Z 
2025-05-07T20:33:34.5209959Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:34.5210369Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:34.5210780Z     T=4096,
2025-05-07T20:33:34.5210968Z     D=5120,
2025-05-07T20:33:34.5211153Z     scale_ub=1200.0,
2025-05-07T20:33:34.5211377Z     contiguous=False,
2025-05-07T20:33:34.5211607Z     compiled=False,
2025-05-07T20:33:34.5211805Z )
2025-05-07T20:33:34.5212121Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:34.5212706Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:34.5212983Z 
2025-05-07T20:33:34.5213068Z     @given(
2025-05-07T20:33:34.5213292Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:34.5213617Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:34.5213933Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:34.5214253Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:34.5214572Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:34.5214855Z     )
2025-05-07T20:33:34.5215203Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:34.5215649Z     def test_silu_mul_quant(
2025-05-07T20:33:34.5215884Z         self,
2025-05-07T20:33:34.5216070Z         T: int,
2025-05-07T20:33:34.5216261Z         D: int,
2025-05-07T20:33:34.5216481Z         scale_ub: Optional[float],
2025-05-07T20:33:34.5216749Z         contiguous: bool,
2025-05-07T20:33:34.5216987Z         compiled: bool,
2025-05-07T20:33:34.5217210Z     ) -> None:
2025-05-07T20:33:34.5217418Z         torch.manual_seed(2025)
2025-05-07T20:33:34.5217654Z     
2025-05-07T20:33:34.5217922Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:34.5218258Z     
2025-05-07T20:33:34.5218441Z         x_sign = torch.sign(x)
2025-05-07T20:33:34.5218728Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:34.5219062Z         x = x_sign * x_clamp
2025-05-07T20:33:34.5219296Z         x0 = x[:, :D]
2025-05-07T20:33:34.5219505Z         x1 = x[:, D:]
2025-05-07T20:33:34.5219703Z     
2025-05-07T20:33:34.5219882Z         if contiguous:
2025-05-07T20:33:34.5220111Z             x0 = x0.contiguous()
2025-05-07T20:33:34.5220356Z             x1 = x1.contiguous()
2025-05-07T20:33:34.5220593Z     
2025-05-07T20:33:34.5220777Z         if scale_ub is not None:
2025-05-07T20:33:34.5221038Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:34.5221372Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:34.5221683Z             )
2025-05-07T20:33:34.5221873Z         else:
2025-05-07T20:33:34.5222147Z             scale_ub_tensor = None
2025-05-07T20:33:34.5222398Z     
2025-05-07T20:33:34.5222620Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:34.5222922Z             op = silu_mul_quant
2025-05-07T20:33:34.5223166Z             if compiled:
2025-05-07T20:33:34.5223409Z                 op = torch.compile(op)
2025-05-07T20:33:34.5223697Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.5223968Z     
2025-05-07T20:33:34.5224158Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:34.5224322Z 
2025-05-07T20:33:34.5224419Z moe/activation_test.py:117: 
2025-05-07T20:33:34.5224710Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.5225038Z moe/activation_test.py:115: in fn
2025-05-07T20:33:34.5225306Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.5225988Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:34.5226672Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:34.5227326Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:34.5228069Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:34.5228720Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:34.5229235Z     kernel = self.compile(
2025-05-07T20:33:34.5229802Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:34.5230463Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:34.5230855Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.5231120Z 
2025-05-07T20:33:34.5231328Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cafbd850>
2025-05-07T20:33:34.5232384Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:34.5233732Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca2556c0>}
2025-05-07T20:33:34.5235041Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:34.5236093Z context = <triton._C.libtriton.ir.context object at 0x7f89ca2fc270>
2025-05-07T20:33:34.5236371Z 
2025-05-07T20:33:34.5236539Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:34.5237044Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:34.5237503Z                            module_map=module_map)
2025-05-07T20:33:34.5237868Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:34.5238204Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:34.5238457Z E       ^
2025-05-07T20:33:34.5238915Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:34.5239364Z 
2025-05-07T20:33:34.5239783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:34.5240487Z 
2025-05-07T20:33:34.5240727Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:34.5241133Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:34.5241535Z     T=4096,
2025-05-07T20:33:34.5241715Z     D=5120,
2025-05-07T20:33:34.5241899Z     scale_ub=1200.0,
2025-05-07T20:33:34.5242116Z     contiguous=False,
2025-05-07T20:33:34.5242405Z     compiled=True,
2025-05-07T20:33:34.5242622Z )
2025-05-07T20:33:34.5242978Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:34.5243536Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:34.5243853Z 
2025-05-07T20:33:34.5243929Z     @given(
2025-05-07T20:33:34.5244166Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:34.5244506Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:34.5244834Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:34.5245195Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:34.5245552Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:34.5245855Z     )
2025-05-07T20:33:34.5246241Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:34.5246746Z     def test_silu_mul_quant(
2025-05-07T20:33:34.5246995Z         self,
2025-05-07T20:33:34.5247195Z         T: int,
2025-05-07T20:33:34.5247396Z         D: int,
2025-05-07T20:33:34.5247792Z         scale_ub: Optional[float],
2025-05-07T20:33:34.5248051Z         contiguous: bool,
2025-05-07T20:33:34.5248285Z         compiled: bool,
2025-05-07T20:33:34.5248500Z     ) -> None:
2025-05-07T20:33:34.5248700Z         torch.manual_seed(2025)
2025-05-07T20:33:34.5248931Z     
2025-05-07T20:33:34.5249198Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:34.5249533Z     
2025-05-07T20:33:34.5249718Z         x_sign = torch.sign(x)
2025-05-07T20:33:34.5249998Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:34.5250293Z         x = x_sign * x_clamp
2025-05-07T20:33:34.5250524Z         x0 = x[:, :D]
2025-05-07T20:33:34.5250731Z         x1 = x[:, D:]
2025-05-07T20:33:34.5250927Z     
2025-05-07T20:33:34.5251171Z         if contiguous:
2025-05-07T20:33:34.5251393Z             x0 = x0.contiguous()
2025-05-07T20:33:34.5251638Z             x1 = x1.contiguous()
2025-05-07T20:33:34.5251869Z     
2025-05-07T20:33:34.5252052Z         if scale_ub is not None:
2025-05-07T20:33:34.5252317Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:34.5252638Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:34.5252943Z             )
2025-05-07T20:33:34.5253124Z         else:
2025-05-07T20:33:34.5253322Z             scale_ub_tensor = None
2025-05-07T20:33:34.5253564Z     
2025-05-07T20:33:34.5253788Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:34.5254086Z             op = silu_mul_quant
2025-05-07T20:33:34.5254328Z             if compiled:
2025-05-07T20:33:34.5254566Z                 op = torch.compile(op)
2025-05-07T20:33:34.5254854Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.5255121Z     
2025-05-07T20:33:34.5255306Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:34.5255469Z 
2025-05-07T20:33:34.5255565Z moe/activation_test.py:117: 
2025-05-07T20:33:34.5255850Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.5256184Z moe/activation_test.py:115: in fn
2025-05-07T20:33:34.5256455Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.5257000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:34.5257539Z     return fn(*args, **kwargs)
2025-05-07T20:33:34.5258185Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:34.5258863Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:34.5259406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:34.5260111Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:34.5260758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:34.5261328Z     kernel = self.compile(
2025-05-07T20:33:34.5261869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:34.5262503Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:34.5262888Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.5263109Z 
2025-05-07T20:33:34.5263318Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cbbe0650>
2025-05-07T20:33:34.5264379Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:34.5265713Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca256fc0>}
2025-05-07T20:33:34.5267072Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:34.5268191Z context = <triton._C.libtriton.ir.context object at 0x7f8819f38eb0>
2025-05-07T20:33:34.5268475Z 
2025-05-07T20:33:34.5268642Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:34.5269143Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:34.5269609Z                            module_map=module_map)
2025-05-07T20:33:34.5269998Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:34.5270379Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:34.5270629Z E       ^
2025-05-07T20:33:34.5271132Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:34.5271570Z 
2025-05-07T20:33:34.5272004Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:34.5272505Z 
2025-05-07T20:33:34.5272606Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:34.5273018Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:34.5273407Z     T=2048,
2025-05-07T20:33:34.5273597Z     D=7168,
2025-05-07T20:33:34.5273782Z     scale_ub=1200.0,
2025-05-07T20:33:34.5274006Z     contiguous=False,
2025-05-07T20:33:34.5274230Z     compiled=False,
2025-05-07T20:33:34.7227024Z )
2025-05-07T20:33:34.7227547Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:34.7228084Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:34.7228370Z 
2025-05-07T20:33:34.7228454Z     @given(
2025-05-07T20:33:34.7228677Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:34.7228986Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:34.7229287Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:34.7229613Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:34.7229937Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:34.7230416Z     )
2025-05-07T20:33:34.7231096Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:34.7231940Z     def test_silu_mul_quant(
2025-05-07T20:33:34.7232415Z         self,
2025-05-07T20:33:34.7232779Z         T: int,
2025-05-07T20:33:34.7233159Z         D: int,
2025-05-07T20:33:34.7233583Z         scale_ub: Optional[float],
2025-05-07T20:33:34.7234120Z         contiguous: bool,
2025-05-07T20:33:34.7234576Z         compiled: bool,
2025-05-07T20:33:34.7235004Z     ) -> None:
2025-05-07T20:33:34.7235424Z         torch.manual_seed(2025)
2025-05-07T20:33:34.7235893Z     
2025-05-07T20:33:34.7236433Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:34.7237302Z     
2025-05-07T20:33:34.7237670Z         x_sign = torch.sign(x)
2025-05-07T20:33:34.7238266Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:34.7238888Z         x = x_sign * x_clamp
2025-05-07T20:33:34.7239355Z         x0 = x[:, :D]
2025-05-07T20:33:34.7239731Z         x1 = x[:, D:]
2025-05-07T20:33:34.7239945Z     
2025-05-07T20:33:34.7240292Z         if contiguous:
2025-05-07T20:33:34.7240525Z             x0 = x0.contiguous()
2025-05-07T20:33:34.7240793Z             x1 = x1.contiguous()
2025-05-07T20:33:34.7241037Z     
2025-05-07T20:33:34.7241228Z         if scale_ub is not None:
2025-05-07T20:33:34.7241508Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:34.7241848Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:34.7242158Z             )
2025-05-07T20:33:34.7242359Z         else:
2025-05-07T20:33:34.7242566Z             scale_ub_tensor = None
2025-05-07T20:33:34.7242802Z     
2025-05-07T20:33:34.7243025Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:34.7243463Z             op = silu_mul_quant
2025-05-07T20:33:34.7243749Z             if compiled:
2025-05-07T20:33:34.7243987Z                 op = torch.compile(op)
2025-05-07T20:33:34.7244279Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.7244558Z     
2025-05-07T20:33:34.7244754Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:34.7244919Z 
2025-05-07T20:33:34.7245019Z moe/activation_test.py:117: 
2025-05-07T20:33:34.7245316Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.7245651Z moe/activation_test.py:115: in fn
2025-05-07T20:33:34.7245924Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.7246599Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:34.7247333Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:34.7247856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:34.7248527Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:34.7249171Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:34.7249703Z     kernel = self.compile(
2025-05-07T20:33:34.7250263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:34.7250921Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:34.7251320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.7251538Z 
2025-05-07T20:33:34.7251741Z self = <triton.compiler.compiler.ASTSource object at 0x7f89ca498350>
2025-05-07T20:33:34.7252799Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:34.7254144Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca257ec0>}
2025-05-07T20:33:34.7255500Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:34.7256549Z context = <triton._C.libtriton.ir.context object at 0x7f8819f7fcb0>
2025-05-07T20:33:34.7256830Z 
2025-05-07T20:33:34.7256997Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:34.7257504Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:34.7257971Z                            module_map=module_map)
2025-05-07T20:33:34.7258406Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:34.7258760Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:34.7259012Z E       ^
2025-05-07T20:33:34.7259477Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:34.7259920Z 
2025-05-07T20:33:34.7260397Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:34.7260900Z 
2025-05-07T20:33:34.7260999Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:34.7261399Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:34.7261805Z     T=1,
2025-05-07T20:33:34.7261990Z     D=7168,
2025-05-07T20:33:34.7262173Z     scale_ub=None,
2025-05-07T20:33:34.7262382Z     contiguous=True,
2025-05-07T20:33:34.7262602Z     compiled=False,
2025-05-07T20:33:34.7262802Z )
2025-05-07T20:33:34.7263124Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:34.7263681Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:34.7263944Z 
2025-05-07T20:33:34.7264021Z     @given(
2025-05-07T20:33:34.7264242Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:34.7264549Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:34.7264842Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:34.7265167Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:34.7265487Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:34.7265763Z     )
2025-05-07T20:33:34.7266093Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:34.7266536Z     def test_silu_mul_quant(
2025-05-07T20:33:34.7266777Z         self,
2025-05-07T20:33:34.7267005Z         T: int,
2025-05-07T20:33:34.7267198Z         D: int,
2025-05-07T20:33:34.7267466Z         scale_ub: Optional[float],
2025-05-07T20:33:34.7267733Z         contiguous: bool,
2025-05-07T20:33:34.7267968Z         compiled: bool,
2025-05-07T20:33:34.7268189Z     ) -> None:
2025-05-07T20:33:34.7268396Z         torch.manual_seed(2025)
2025-05-07T20:33:34.7268631Z     
2025-05-07T20:33:34.7268893Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:34.7269217Z     
2025-05-07T20:33:34.7269403Z         x_sign = torch.sign(x)
2025-05-07T20:33:34.7269682Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:34.7269997Z         x = x_sign * x_clamp
2025-05-07T20:33:34.7270263Z         x0 = x[:, :D]
2025-05-07T20:33:34.7270471Z         x1 = x[:, D:]
2025-05-07T20:33:34.7270674Z     
2025-05-07T20:33:34.7270851Z         if contiguous:
2025-05-07T20:33:34.7271076Z             x0 = x0.contiguous()
2025-05-07T20:33:34.7271345Z             x1 = x1.contiguous()
2025-05-07T20:33:34.7271574Z     
2025-05-07T20:33:34.7271763Z         if scale_ub is not None:
2025-05-07T20:33:34.7272040Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:34.7272371Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:34.7272686Z             )
2025-05-07T20:33:34.7272886Z         else:
2025-05-07T20:33:34.7273091Z             scale_ub_tensor = None
2025-05-07T20:33:34.7273334Z     
2025-05-07T20:33:34.7273562Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:34.7273859Z             op = silu_mul_quant
2025-05-07T20:33:34.7274104Z             if compiled:
2025-05-07T20:33:34.7274347Z                 op = torch.compile(op)
2025-05-07T20:33:34.7274628Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.7274897Z     
2025-05-07T20:33:34.7275076Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:34.7275234Z 
2025-05-07T20:33:34.7275330Z moe/activation_test.py:117: 
2025-05-07T20:33:34.7275612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.7275937Z moe/activation_test.py:115: in fn
2025-05-07T20:33:34.7276261Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.7276954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:34.7277649Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:34.7278180Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:34.7278851Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:34.7279499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:34.7280029Z     kernel = self.compile(
2025-05-07T20:33:34.7280570Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:34.7281236Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:34.7281629Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.7281935Z 
2025-05-07T20:33:34.7282173Z self = <triton.compiler.compiler.ASTSource object at 0x7f8819fadc50>
2025-05-07T20:33:34.7283279Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:34.7284630Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8819f10cc0>}
2025-05-07T20:33:34.7285985Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:34.7287024Z context = <triton._C.libtriton.ir.context object at 0x7f8819f731f0>
2025-05-07T20:33:34.7287301Z 
2025-05-07T20:33:34.7287469Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:34.7287977Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:34.7288432Z                            module_map=module_map)
2025-05-07T20:33:34.7288786Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:34.7289122Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:34.7289376Z E       ^
2025-05-07T20:33:34.7289846Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:34.7290305Z 
2025-05-07T20:33:34.7290726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:34.7291229Z 
2025-05-07T20:33:34.7291334Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:34.7291728Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:34.7292129Z     T=16384,
2025-05-07T20:33:34.7292321Z     D=7168,
2025-05-07T20:33:34.7292504Z     scale_ub=1200.0,
2025-05-07T20:33:34.7298565Z     contiguous=False,
2025-05-07T20:33:34.7298802Z     compiled=True,
2025-05-07T20:33:34.7299005Z )
2025-05-07T20:33:34.7299323Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:34.7299817Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:34.7300100Z 
2025-05-07T20:33:34.7300188Z     @given(
2025-05-07T20:33:34.7300410Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:34.7300714Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:34.7301015Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:34.7301328Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:34.7301648Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:34.7301920Z     )
2025-05-07T20:33:34.7302325Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:34.7302780Z     def test_silu_mul_quant(
2025-05-07T20:33:34.7303021Z         self,
2025-05-07T20:33:34.7303207Z         T: int,
2025-05-07T20:33:34.7303404Z         D: int,
2025-05-07T20:33:34.7303611Z         scale_ub: Optional[float],
2025-05-07T20:33:34.7303874Z         contiguous: bool,
2025-05-07T20:33:34.7304113Z         compiled: bool,
2025-05-07T20:33:34.7304333Z     ) -> None:
2025-05-07T20:33:34.7304538Z         torch.manual_seed(2025)
2025-05-07T20:33:34.7304776Z     
2025-05-07T20:33:34.7305040Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:34.7305377Z     
2025-05-07T20:33:34.7305559Z         x_sign = torch.sign(x)
2025-05-07T20:33:34.7305845Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:34.7306152Z         x = x_sign * x_clamp
2025-05-07T20:33:34.7306382Z         x0 = x[:, :D]
2025-05-07T20:33:34.7306594Z         x1 = x[:, D:]
2025-05-07T20:33:34.7306798Z     
2025-05-07T20:33:34.7306971Z         if contiguous:
2025-05-07T20:33:34.7307283Z             x0 = x0.contiguous()
2025-05-07T20:33:34.7307609Z             x1 = x1.contiguous()
2025-05-07T20:33:34.7307839Z     
2025-05-07T20:33:34.7308022Z         if scale_ub is not None:
2025-05-07T20:33:34.7308290Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:34.7308610Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:34.7308907Z             )
2025-05-07T20:33:34.7309093Z         else:
2025-05-07T20:33:34.7309289Z             scale_ub_tensor = None
2025-05-07T20:33:34.7309538Z     
2025-05-07T20:33:34.7309791Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:34.7310115Z             op = silu_mul_quant
2025-05-07T20:33:34.7310353Z             if compiled:
2025-05-07T20:33:34.7310640Z                 op = torch.compile(op)
2025-05-07T20:33:34.7310931Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.7311199Z     
2025-05-07T20:33:34.7311394Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:34.7311560Z 
2025-05-07T20:33:34.7311668Z moe/activation_test.py:117: 
2025-05-07T20:33:34.7311951Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.7312271Z moe/activation_test.py:115: in fn
2025-05-07T20:33:34.7312546Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.7313104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:34.7313650Z     return fn(*args, **kwargs)
2025-05-07T20:33:34.7314295Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:34.7314964Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:34.7315512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:34.7316176Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:34.7316827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:34.7317342Z     kernel = self.compile(
2025-05-07T20:33:34.7317888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:34.7318547Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:34.7318930Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.7319149Z 
2025-05-07T20:33:34.7319352Z self = <triton.compiler.compiler.ASTSource object at 0x7f89ca1e4ad0>
2025-05-07T20:33:34.7320408Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:34.7321815Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8819f120c0>}
2025-05-07T20:33:34.7323127Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:34.7324177Z context = <triton._C.libtriton.ir.context object at 0x7f89cb71b470>
2025-05-07T20:33:34.7324464Z 
2025-05-07T20:33:34.7324627Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:34.7325139Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:34.7325594Z                            module_map=module_map)
2025-05-07T20:33:34.7325954Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:34.7326298Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:34.7326553Z E       ^
2025-05-07T20:33:34.7327049Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:34.7327541Z 
2025-05-07T20:33:34.7327960Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:34.8661303Z 
2025-05-07T20:33:34.8661521Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:34.8661936Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:34.8662337Z     T=1,
2025-05-07T20:33:34.8662515Z     D=7168,
2025-05-07T20:33:34.8662768Z     scale_ub=None,
2025-05-07T20:33:34.8662973Z     contiguous=False,
2025-05-07T20:33:34.8663193Z     compiled=False,
2025-05-07T20:33:34.8663394Z )
2025-05-07T20:33:34.8663824Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:34.8664317Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:34.8664574Z 
2025-05-07T20:33:34.8664661Z     @given(
2025-05-07T20:33:34.8664885Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:34.8665185Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:34.8665485Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:34.8665808Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:34.8666126Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:34.8666404Z     )
2025-05-07T20:33:34.8666745Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:34.8667196Z     def test_silu_mul_quant(
2025-05-07T20:33:34.8667497Z         self,
2025-05-07T20:33:34.8667704Z         T: int,
2025-05-07T20:33:34.8667903Z         D: int,
2025-05-07T20:33:34.8668123Z         scale_ub: Optional[float],
2025-05-07T20:33:34.8668397Z         contiguous: bool,
2025-05-07T20:33:34.8668629Z         compiled: bool,
2025-05-07T20:33:34.8668845Z     ) -> None:
2025-05-07T20:33:34.8669060Z         torch.manual_seed(2025)
2025-05-07T20:33:34.8669293Z     
2025-05-07T20:33:34.8669560Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:34.8669896Z     
2025-05-07T20:33:34.8670083Z         x_sign = torch.sign(x)
2025-05-07T20:33:34.8670368Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:34.8670674Z         x = x_sign * x_clamp
2025-05-07T20:33:34.8670904Z         x0 = x[:, :D]
2025-05-07T20:33:34.8671112Z         x1 = x[:, D:]
2025-05-07T20:33:34.8671317Z     
2025-05-07T20:33:34.8671502Z         if contiguous:
2025-05-07T20:33:34.8671730Z             x0 = x0.contiguous()
2025-05-07T20:33:34.8671978Z             x1 = x1.contiguous()
2025-05-07T20:33:34.8672217Z     
2025-05-07T20:33:34.8672402Z         if scale_ub is not None:
2025-05-07T20:33:34.8672673Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:34.8673010Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:34.8673380Z             )
2025-05-07T20:33:34.8673582Z         else:
2025-05-07T20:33:34.8673803Z             scale_ub_tensor = None
2025-05-07T20:33:34.8674072Z     
2025-05-07T20:33:34.8674319Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:34.8674659Z             op = silu_mul_quant
2025-05-07T20:33:34.8674918Z             if compiled:
2025-05-07T20:33:34.8675179Z                 op = torch.compile(op)
2025-05-07T20:33:34.8675496Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.8675797Z     
2025-05-07T20:33:34.8675990Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:34.8676172Z 
2025-05-07T20:33:34.8676273Z moe/activation_test.py:117: 
2025-05-07T20:33:34.8676594Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.8676959Z moe/activation_test.py:115: in fn
2025-05-07T20:33:34.8677270Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.8678088Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:34.8679049Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:34.8679667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:34.8680472Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:34.8681253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:34.8681871Z     kernel = self.compile(
2025-05-07T20:33:34.8682499Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:34.8683270Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:34.8683757Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.8684022Z 
2025-05-07T20:33:34.8684251Z self = <triton.compiler.compiler.ASTSource object at 0x7f89ca499850>
2025-05-07T20:33:34.8685570Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:34.8687272Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8819f12c00>}
2025-05-07T20:33:34.8688922Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:34.8690211Z context = <triton._C.libtriton.ir.context object at 0x7f89ca1a7b30>
2025-05-07T20:33:34.8690548Z 
2025-05-07T20:33:34.8690728Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:34.8691330Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:34.8691878Z                            module_map=module_map)
2025-05-07T20:33:34.8692274Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:34.8692662Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:34.8692937Z E       ^
2025-05-07T20:33:34.8693466Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:34.8694014Z 
2025-05-07T20:33:34.8694510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:34.8695131Z 
2025-05-07T20:33:34.8695235Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:34.8695696Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:34.8696148Z     T=2048,
2025-05-07T20:33:34.8696338Z     D=7168,
2025-05-07T20:33:34.8696536Z     scale_ub=None,
2025-05-07T20:33:34.8696830Z     contiguous=False,
2025-05-07T20:33:34.8697056Z     compiled=True,
2025-05-07T20:33:34.8697257Z )
2025-05-07T20:33:34.8697561Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:34.8698045Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:34.8698311Z 
2025-05-07T20:33:34.8698386Z     @given(
2025-05-07T20:33:34.8698608Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:34.8698906Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:34.8699201Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:34.8699522Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:34.8699854Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:34.8700173Z     )
2025-05-07T20:33:34.8700512Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:34.8700939Z     def test_silu_mul_quant(
2025-05-07T20:33:34.8701174Z         self,
2025-05-07T20:33:34.8701411Z         T: int,
2025-05-07T20:33:34.8701643Z         D: int,
2025-05-07T20:33:34.8701849Z         scale_ub: Optional[float],
2025-05-07T20:33:34.8702111Z         contiguous: bool,
2025-05-07T20:33:34.8702346Z         compiled: bool,
2025-05-07T20:33:34.8702566Z     ) -> None:
2025-05-07T20:33:34.8702772Z         torch.manual_seed(2025)
2025-05-07T20:33:34.8703006Z     
2025-05-07T20:33:34.8703265Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:34.8703598Z     
2025-05-07T20:33:34.8703782Z         x_sign = torch.sign(x)
2025-05-07T20:33:34.8704062Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:34.8704359Z         x = x_sign * x_clamp
2025-05-07T20:33:34.8704591Z         x0 = x[:, :D]
2025-05-07T20:33:34.8704852Z         x1 = x[:, D:]
2025-05-07T20:33:34.8705052Z     
2025-05-07T20:33:34.8705231Z         if contiguous:
2025-05-07T20:33:34.8705458Z             x0 = x0.contiguous()
2025-05-07T20:33:34.8705707Z             x1 = x1.contiguous()
2025-05-07T20:33:34.8705944Z     
2025-05-07T20:33:34.8706132Z         if scale_ub is not None:
2025-05-07T20:33:34.8706392Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:34.8706720Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:34.8707026Z             )
2025-05-07T20:33:34.8707211Z         else:
2025-05-07T20:33:34.8707482Z             scale_ub_tensor = None
2025-05-07T20:33:34.8707728Z     
2025-05-07T20:33:34.8707955Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:34.8708254Z             op = silu_mul_quant
2025-05-07T20:33:34.8708505Z             if compiled:
2025-05-07T20:33:34.8708751Z                 op = torch.compile(op)
2025-05-07T20:33:34.8709038Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.8709316Z     
2025-05-07T20:33:34.8709505Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:34.8709664Z 
2025-05-07T20:33:34.8709761Z moe/activation_test.py:117: 
2025-05-07T20:33:34.8710056Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.8710384Z moe/activation_test.py:115: in fn
2025-05-07T20:33:34.8710658Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:34.8711227Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:34.8711779Z     return fn(*args, **kwargs)
2025-05-07T20:33:34.8712433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:34.8713104Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:34.8713634Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:34.8714300Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:34.8714999Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:34.8715521Z     kernel = self.compile(
2025-05-07T20:33:34.8716062Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:34.8716696Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:34.8717076Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:34.8717301Z 
2025-05-07T20:33:34.8717498Z self = <triton.compiler.compiler.ASTSource object at 0x7f89ca719950>
2025-05-07T20:33:34.8718548Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:34.8719905Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca1842c0>}
2025-05-07T20:33:34.8721337Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:34.8722327Z context = <triton._C.libtriton.ir.context object at 0x7f89ca174530>
2025-05-07T20:33:34.8722616Z 
2025-05-07T20:33:34.8722777Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:34.8723298Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:34.8723768Z                            module_map=module_map)
2025-05-07T20:33:34.8724128Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:34.8724479Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:34.8724785Z E       ^
2025-05-07T20:33:34.8725257Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:34.8725704Z 
2025-05-07T20:33:34.8726137Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:34.8726647Z 
2025-05-07T20:33:34.8726748Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:34.8727150Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:34.8727537Z     T=4096,
2025-05-07T20:33:34.8727731Z     D=7168,
2025-05-07T20:33:34.8727924Z     scale_ub=None,
2025-05-07T20:33:34.8728138Z     contiguous=False,
2025-05-07T20:33:34.8728366Z     compiled=True,
2025-05-07T20:33:35.2843886Z )
2025-05-07T20:33:35.2844464Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:35.2845159Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:35.2845551Z 
2025-05-07T20:33:35.2845666Z     @given(
2025-05-07T20:33:35.2845951Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:35.2846268Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:35.2846587Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:35.2846917Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:35.2847237Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:35.2847522Z     )
2025-05-07T20:33:35.2847867Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:35.2848322Z     def test_silu_mul_quant(
2025-05-07T20:33:35.2848565Z         self,
2025-05-07T20:33:35.2848761Z         T: int,
2025-05-07T20:33:35.2848952Z         D: int,
2025-05-07T20:33:35.2849178Z         scale_ub: Optional[float],
2025-05-07T20:33:35.2849454Z         contiguous: bool,
2025-05-07T20:33:35.2849691Z         compiled: bool,
2025-05-07T20:33:35.2849928Z     ) -> None:
2025-05-07T20:33:35.2850148Z         torch.manual_seed(2025)
2025-05-07T20:33:35.2850386Z     
2025-05-07T20:33:35.2850945Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:35.2851298Z     
2025-05-07T20:33:35.2851493Z         x_sign = torch.sign(x)
2025-05-07T20:33:35.2851790Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:35.2852104Z         x = x_sign * x_clamp
2025-05-07T20:33:35.2852352Z         x0 = x[:, :D]
2025-05-07T20:33:35.2852561Z         x1 = x[:, D:]
2025-05-07T20:33:35.2852774Z     
2025-05-07T20:33:35.2852970Z         if contiguous:
2025-05-07T20:33:35.2853196Z             x0 = x0.contiguous()
2025-05-07T20:33:35.2853460Z             x1 = x1.contiguous()
2025-05-07T20:33:35.2853703Z     
2025-05-07T20:33:35.2853889Z         if scale_ub is not None:
2025-05-07T20:33:35.2854170Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:35.2854515Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:35.2854833Z             )
2025-05-07T20:33:35.2855034Z         else:
2025-05-07T20:33:35.2855254Z             scale_ub_tensor = None
2025-05-07T20:33:35.2855501Z     
2025-05-07T20:33:35.2855835Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:35.2856235Z             op = silu_mul_quant
2025-05-07T20:33:35.2856484Z             if compiled:
2025-05-07T20:33:35.2856741Z                 op = torch.compile(op)
2025-05-07T20:33:35.2857040Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:35.2857318Z     
2025-05-07T20:33:35.2857502Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:35.2857675Z 
2025-05-07T20:33:35.2857776Z moe/activation_test.py:117: 
2025-05-07T20:33:35.2858076Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:35.2858399Z moe/activation_test.py:115: in fn
2025-05-07T20:33:35.2858687Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:35.2859267Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:35.2859949Z     return fn(*args, **kwargs)
2025-05-07T20:33:35.2860630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:35.2861312Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:35.2861851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:35.2862531Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:35.2863197Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:35.2863725Z     kernel = self.compile(
2025-05-07T20:33:35.2864278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:35.2864921Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:35.2865322Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:35.2865545Z 
2025-05-07T20:33:35.2865765Z self = <triton.compiler.compiler.ASTSource object at 0x7f89ca498fd0>
2025-05-07T20:33:35.2866837Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:35.2868373Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca184d60>}
2025-05-07T20:33:35.2869758Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:35.2870774Z context = <triton._C.libtriton.ir.context object at 0x7f89ca1830f0>
2025-05-07T20:33:35.2871057Z 
2025-05-07T20:33:35.2871286Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:35.2871802Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:35.2872274Z                            module_map=module_map)
2025-05-07T20:33:35.2872644Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:35.2872997Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:35.2873248Z E       ^
2025-05-07T20:33:35.2873710Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:35.2874172Z 
2025-05-07T20:33:35.2874613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:35.2875154Z 
2025-05-07T20:33:35.2875259Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:35.2875675Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:35.2876076Z     T=16384,
2025-05-07T20:33:35.2876279Z     D=5120,
2025-05-07T20:33:35.2876468Z     scale_ub=1200.0,
2025-05-07T20:33:35.2876790Z     contiguous=False,
2025-05-07T20:33:35.2877021Z     compiled=False,
2025-05-07T20:33:35.2877223Z )
2025-05-07T20:33:35.2877548Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:35.2878046Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:35.2878321Z 
2025-05-07T20:33:35.2878400Z     @given(
2025-05-07T20:33:35.2878632Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:35.2878953Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:35.2879251Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:35.2879580Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:35.2879937Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:35.2880298Z     )
2025-05-07T20:33:35.2880646Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:35.2881089Z     def test_silu_mul_quant(
2025-05-07T20:33:35.2881339Z         self,
2025-05-07T20:33:35.2881569Z         T: int,
2025-05-07T20:33:35.2881768Z         D: int,
2025-05-07T20:33:35.2881979Z         scale_ub: Optional[float],
2025-05-07T20:33:35.2882251Z         contiguous: bool,
2025-05-07T20:33:35.2882493Z         compiled: bool,
2025-05-07T20:33:35.2882716Z     ) -> None:
2025-05-07T20:33:35.2882929Z         torch.manual_seed(2025)
2025-05-07T20:33:35.2883170Z     
2025-05-07T20:33:35.2883439Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:35.2883783Z     
2025-05-07T20:33:35.2883980Z         x_sign = torch.sign(x)
2025-05-07T20:33:35.2884265Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:35.2884578Z         x = x_sign * x_clamp
2025-05-07T20:33:35.2884822Z         x0 = x[:, :D]
2025-05-07T20:33:35.2885030Z         x1 = x[:, D:]
2025-05-07T20:33:35.2885238Z     
2025-05-07T20:33:35.2885447Z         if contiguous:
2025-05-07T20:33:35.2885687Z             x0 = x0.contiguous()
2025-05-07T20:33:35.2885943Z             x1 = x1.contiguous()
2025-05-07T20:33:35.2886182Z     
2025-05-07T20:33:35.2886380Z         if scale_ub is not None:
2025-05-07T20:33:35.2894433Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:35.2894787Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:35.2895112Z             )
2025-05-07T20:33:35.2895303Z         else:
2025-05-07T20:33:35.2895513Z             scale_ub_tensor = None
2025-05-07T20:33:35.2895766Z     
2025-05-07T20:33:35.2895994Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:35.2896312Z             op = silu_mul_quant
2025-05-07T20:33:35.2896565Z             if compiled:
2025-05-07T20:33:35.2896808Z                 op = torch.compile(op)
2025-05-07T20:33:35.2897113Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:35.2897395Z     
2025-05-07T20:33:35.2897586Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:35.2897837Z 
2025-05-07T20:33:35.2897939Z moe/activation_test.py:117: 
2025-05-07T20:33:35.2898243Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:35.2898582Z moe/activation_test.py:115: in fn
2025-05-07T20:33:35.2898861Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:35.2899558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:35.2900246Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:35.2900793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:35.2901477Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:35.2902145Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:35.2902674Z     kernel = self.compile(
2025-05-07T20:33:35.2903281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:35.2903972Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:35.2904373Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:35.2904597Z 
2025-05-07T20:33:35.2904802Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cbbe30d0>
2025-05-07T20:33:35.2905873Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:35.2907242Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca185c60>}
2025-05-07T20:33:35.2908753Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:35.2909821Z context = <triton._C.libtriton.ir.context object at 0x7f8819ecd3f0>
2025-05-07T20:33:35.2910145Z 
2025-05-07T20:33:35.2910311Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:35.2910825Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:35.2911289Z                            module_map=module_map)
2025-05-07T20:33:35.2911659Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:35.2912002Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:35.2912259Z E       ^
2025-05-07T20:33:35.2912721Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:35.2913178Z 
2025-05-07T20:33:35.2913591Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:35.2914108Z 
2025-05-07T20:33:35.2914209Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:35.2914614Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:35.2915019Z     T=16384,
2025-05-07T20:33:35.2915204Z     D=5120,
2025-05-07T20:33:35.2915401Z     scale_ub=1200.0,
2025-05-07T20:33:35.2915626Z     contiguous=True,
2025-05-07T20:33:35.2915839Z     compiled=True,
2025-05-07T20:33:35.2916047Z )
2025-05-07T20:33:35.2916374Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:35.2916856Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:35.2917134Z 
2025-05-07T20:33:35.2917213Z     @given(
2025-05-07T20:33:35.2917447Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:35.2917762Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:35.2918114Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:35.2918449Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:35.2918772Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:35.2919053Z     )
2025-05-07T20:33:35.2919400Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:35.2919852Z     def test_silu_mul_quant(
2025-05-07T20:33:35.2920090Z         self,
2025-05-07T20:33:35.2920294Z         T: int,
2025-05-07T20:33:35.2920494Z         D: int,
2025-05-07T20:33:35.2920711Z         scale_ub: Optional[float],
2025-05-07T20:33:35.2920986Z         contiguous: bool,
2025-05-07T20:33:35.2921234Z         compiled: bool,
2025-05-07T20:33:35.2921454Z     ) -> None:
2025-05-07T20:33:35.2921672Z         torch.manual_seed(2025)
2025-05-07T20:33:35.2921919Z     
2025-05-07T20:33:35.2922189Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:35.2922530Z     
2025-05-07T20:33:35.2922729Z         x_sign = torch.sign(x)
2025-05-07T20:33:35.2923103Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:35.2923452Z         x = x_sign * x_clamp
2025-05-07T20:33:35.2923697Z         x0 = x[:, :D]
2025-05-07T20:33:35.2923917Z         x1 = x[:, D:]
2025-05-07T20:33:35.2924121Z     
2025-05-07T20:33:35.2924308Z         if contiguous:
2025-05-07T20:33:35.2924545Z             x0 = x0.contiguous()
2025-05-07T20:33:35.2924796Z             x1 = x1.contiguous()
2025-05-07T20:33:35.2925041Z     
2025-05-07T20:33:35.2925233Z         if scale_ub is not None:
2025-05-07T20:33:35.2925504Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:35.2925840Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:35.2926141Z             )
2025-05-07T20:33:35.2926330Z         else:
2025-05-07T20:33:35.2926591Z             scale_ub_tensor = None
2025-05-07T20:33:35.2926843Z     
2025-05-07T20:33:35.2927063Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:35.2927375Z             op = silu_mul_quant
2025-05-07T20:33:35.2927634Z             if compiled:
2025-05-07T20:33:35.2927883Z                 op = torch.compile(op)
2025-05-07T20:33:35.2928171Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:35.2928445Z     
2025-05-07T20:33:35.2928638Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:35.2928801Z 
2025-05-07T20:33:35.2928899Z moe/activation_test.py:117: 
2025-05-07T20:33:35.2929195Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:35.2929530Z moe/activation_test.py:115: in fn
2025-05-07T20:33:35.2929805Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:35.2930387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:35.2930949Z     return fn(*args, **kwargs)
2025-05-07T20:33:35.2931641Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:35.2932313Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:35.2932876Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:35.2933552Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:35.2934205Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:35.2934731Z     kernel = self.compile(
2025-05-07T20:33:35.2935293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:35.2935943Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:35.2936331Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:35.2936569Z 
2025-05-07T20:33:35.2936773Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cafbe550>
2025-05-07T20:33:35.2937891Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:35.2939259Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca187380>}
2025-05-07T20:33:35.2940922Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:35.2941936Z context = <triton._C.libtriton.ir.context object at 0x7f8819e1d1b0>
2025-05-07T20:33:35.2942232Z 
2025-05-07T20:33:35.2942396Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:35.2942927Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:35.2943563Z                            module_map=module_map)
2025-05-07T20:33:35.2943934Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:35.2944291Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:35.2944557Z E       ^
2025-05-07T20:33:35.2945015Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:35.2945479Z 
2025-05-07T20:33:35.2945905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:35.4495286Z 
2025-05-07T20:33:35.4495900Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:35.4497113Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:35.4498722Z     T=16384,
2025-05-07T20:33:35.4499243Z     D=5120,
2025-05-07T20:33:35.4499735Z     scale_ub=None,
2025-05-07T20:33:35.4500092Z     contiguous=False,
2025-05-07T20:33:35.4500389Z     compiled=True,
2025-05-07T20:33:35.4500637Z )
2025-05-07T20:33:35.4500960Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:35.4501459Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:35.4501738Z 
2025-05-07T20:33:35.4501842Z     @given(
2025-05-07T20:33:35.4502073Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:35.4502390Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:35.4502688Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:35.4503019Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:35.4503346Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:35.4503629Z     )
2025-05-07T20:33:35.4503973Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:35.4504419Z     def test_silu_mul_quant(
2025-05-07T20:33:35.4504659Z         self,
2025-05-07T20:33:35.4504853Z         T: int,
2025-05-07T20:33:35.4505054Z         D: int,
2025-05-07T20:33:35.4505280Z         scale_ub: Optional[float],
2025-05-07T20:33:35.4505545Z         contiguous: bool,
2025-05-07T20:33:35.4505787Z         compiled: bool,
2025-05-07T20:33:35.4506017Z     ) -> None:
2025-05-07T20:33:35.4506230Z         torch.manual_seed(2025)
2025-05-07T20:33:35.4506469Z     
2025-05-07T20:33:35.4506747Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:35.4507079Z     
2025-05-07T20:33:35.4507279Z         x_sign = torch.sign(x)
2025-05-07T20:33:35.4507657Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:35.4507972Z         x = x_sign * x_clamp
2025-05-07T20:33:35.4508206Z         x0 = x[:, :D]
2025-05-07T20:33:35.4508428Z         x1 = x[:, D:]
2025-05-07T20:33:35.4508640Z     
2025-05-07T20:33:35.4508819Z         if contiguous:
2025-05-07T20:33:35.4509054Z             x0 = x0.contiguous()
2025-05-07T20:33:35.4509314Z             x1 = x1.contiguous()
2025-05-07T20:33:35.4509649Z     
2025-05-07T20:33:35.4509851Z         if scale_ub is not None:
2025-05-07T20:33:35.4510132Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:35.4510464Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:35.4510774Z             )
2025-05-07T20:33:35.4510972Z         else:
2025-05-07T20:33:35.4511179Z             scale_ub_tensor = None
2025-05-07T20:33:35.4511434Z     
2025-05-07T20:33:35.4511671Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:35.4511980Z             op = silu_mul_quant
2025-05-07T20:33:35.4512231Z             if compiled:
2025-05-07T20:33:35.4512487Z                 op = torch.compile(op)
2025-05-07T20:33:35.4512785Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:35.4513049Z     
2025-05-07T20:33:35.4513245Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:35.4513410Z 
2025-05-07T20:33:35.4513518Z moe/activation_test.py:117: 
2025-05-07T20:33:35.4513813Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:35.4514295Z moe/activation_test.py:115: in fn
2025-05-07T20:33:35.4514578Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:35.4515146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:35.4515718Z     return fn(*args, **kwargs)
2025-05-07T20:33:35.4516383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:35.4517058Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:35.4517604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:35.4518281Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:35.4518992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:35.4519515Z     kernel = self.compile(
2025-05-07T20:33:35.4520072Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:35.4520720Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:35.4521108Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:35.4521342Z 
2025-05-07T20:33:35.4521547Z self = <triton.compiler.compiler.ASTSource object at 0x7f8819fae750>
2025-05-07T20:33:35.4522620Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:35.4524007Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8819ea05e0>}
2025-05-07T20:33:35.4525337Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:35.4526392Z context = <triton._C.libtriton.ir.context object at 0x7f8819e788b0>
2025-05-07T20:33:35.4526684Z 
2025-05-07T20:33:35.4526849Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:35.4527369Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:35.4527834Z                            module_map=module_map)
2025-05-07T20:33:35.4528197Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:35.4528548Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:35.4528810Z E       ^
2025-05-07T20:33:35.4529262Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:35.4529769Z 
2025-05-07T20:33:35.4530201Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:35.4530717Z 
2025-05-07T20:33:35.4530818Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:35.4531227Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:35.4531631Z     T=2048,
2025-05-07T20:33:35.4531820Z     D=5120,
2025-05-07T20:33:35.4532016Z     scale_ub=None,
2025-05-07T20:33:35.4532229Z     contiguous=False,
2025-05-07T20:33:35.4532457Z     compiled=True,
2025-05-07T20:33:35.4532662Z )
2025-05-07T20:33:35.4532980Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:35.4533474Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:35.4533742Z 
2025-05-07T20:33:35.4533829Z     @given(
2025-05-07T20:33:35.4534054Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:35.4534374Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:35.4534766Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:35.4535096Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:35.4535413Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:35.4535706Z     )
2025-05-07T20:33:35.4536053Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:35.4536487Z     def test_silu_mul_quant(
2025-05-07T20:33:35.4536737Z         self,
2025-05-07T20:33:35.4536935Z         T: int,
2025-05-07T20:33:35.4537129Z         D: int,
2025-05-07T20:33:35.4537353Z         scale_ub: Optional[float],
2025-05-07T20:33:35.4537626Z         contiguous: bool,
2025-05-07T20:33:35.4537864Z         compiled: bool,
2025-05-07T20:33:35.4538090Z     ) -> None:
2025-05-07T20:33:35.4538359Z         torch.manual_seed(2025)
2025-05-07T20:33:35.4538595Z     
2025-05-07T20:33:35.4538869Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:35.4539221Z     
2025-05-07T20:33:35.4539424Z         x_sign = torch.sign(x)
2025-05-07T20:33:35.4539717Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:35.4540028Z         x = x_sign * x_clamp
2025-05-07T20:33:35.4540577Z         x0 = x[:, :D]
2025-05-07T20:33:35.4540790Z         x1 = x[:, D:]
2025-05-07T20:33:35.4541000Z     
2025-05-07T20:33:35.4541193Z         if contiguous:
2025-05-07T20:33:35.4541418Z             x0 = x0.contiguous()
2025-05-07T20:33:35.4541678Z             x1 = x1.contiguous()
2025-05-07T20:33:35.4541919Z     
2025-05-07T20:33:35.4542107Z         if scale_ub is not None:
2025-05-07T20:33:35.4542381Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:35.4542715Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:35.4543014Z             )
2025-05-07T20:33:35.4543210Z         else:
2025-05-07T20:33:35.4543418Z             scale_ub_tensor = None
2025-05-07T20:33:35.4543656Z     
2025-05-07T20:33:35.4543882Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:35.4544199Z             op = silu_mul_quant
2025-05-07T20:33:35.4544438Z             if compiled:
2025-05-07T20:33:35.4544681Z                 op = torch.compile(op)
2025-05-07T20:33:35.4544978Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:35.4545258Z     
2025-05-07T20:33:35.4545449Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:35.4545626Z 
2025-05-07T20:33:35.4545724Z moe/activation_test.py:117: 
2025-05-07T20:33:35.4546025Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:35.4546350Z moe/activation_test.py:115: in fn
2025-05-07T20:33:35.4546634Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:35.4547198Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:35.4547835Z     return fn(*args, **kwargs)
2025-05-07T20:33:35.4548604Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:35.4549285Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:35.4549825Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:35.4550489Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:35.4551149Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:35.4551674Z     kernel = self.compile(
2025-05-07T20:33:35.4552207Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:35.4552849Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:35.4553249Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:35.4553471Z 
2025-05-07T20:33:35.4553685Z self = <triton.compiler.compiler.ASTSource object at 0x7f8819f234d0>
2025-05-07T20:33:35.4555269Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:35.4556664Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8819ea1440>}
2025-05-07T20:33:35.4558031Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:35.4559092Z context = <triton._C.libtriton.ir.context object at 0x7f8819dac1f0>
2025-05-07T20:33:35.4559441Z 
2025-05-07T20:33:35.4559614Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:35.4560129Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:35.4560606Z                            module_map=module_map)
2025-05-07T20:33:35.4560966Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:35.4561319Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:35.4561572Z E       ^
2025-05-07T20:33:35.4562038Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:35.4562485Z 
2025-05-07T20:33:35.4562913Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:35.6154981Z 
2025-05-07T20:33:35.6155511Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:35.6156160Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:35.6156721Z     T=2048,
2025-05-07T20:33:35.6156980Z     D=5120,
2025-05-07T20:33:35.6157205Z     scale_ub=1200.0,
2025-05-07T20:33:35.6157424Z     contiguous=False,
2025-05-07T20:33:35.6157667Z     compiled=True,
2025-05-07T20:33:35.6157878Z )
2025-05-07T20:33:35.6158192Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:35.6158685Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:35.6158962Z 
2025-05-07T20:33:35.6159041Z     @given(
2025-05-07T20:33:35.6159270Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:35.6159571Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:35.6159875Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:35.6160204Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:35.6160527Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:35.6160817Z     )
2025-05-07T20:33:35.6161172Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:35.6161617Z     def test_silu_mul_quant(
2025-05-07T20:33:35.6162129Z         self,
2025-05-07T20:33:35.6162335Z         T: int,
2025-05-07T20:33:35.6162534Z         D: int,
2025-05-07T20:33:35.6162755Z         scale_ub: Optional[float],
2025-05-07T20:33:35.6163028Z         contiguous: bool,
2025-05-07T20:33:35.6163272Z         compiled: bool,
2025-05-07T20:33:35.6163494Z     ) -> None:
2025-05-07T20:33:35.6163711Z         torch.manual_seed(2025)
2025-05-07T20:33:35.6163949Z     
2025-05-07T20:33:35.6164214Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:35.6164565Z     
2025-05-07T20:33:35.6164758Z         x_sign = torch.sign(x)
2025-05-07T20:33:35.6165040Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:35.6165351Z         x = x_sign * x_clamp
2025-05-07T20:33:35.6165590Z         x0 = x[:, :D]
2025-05-07T20:33:35.6165803Z         x1 = x[:, D:]
2025-05-07T20:33:35.6166014Z     
2025-05-07T20:33:35.6166200Z         if contiguous:
2025-05-07T20:33:35.6166430Z             x0 = x0.contiguous()
2025-05-07T20:33:35.6166685Z             x1 = x1.contiguous()
2025-05-07T20:33:35.6167105Z     
2025-05-07T20:33:35.6167292Z         if scale_ub is not None:
2025-05-07T20:33:35.6167564Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:35.6167892Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:35.6168195Z             )
2025-05-07T20:33:35.6168382Z         else:
2025-05-07T20:33:35.6168589Z             scale_ub_tensor = None
2025-05-07T20:33:35.6168838Z     
2025-05-07T20:33:35.6169066Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:35.6169370Z             op = silu_mul_quant
2025-05-07T20:33:35.6169620Z             if compiled:
2025-05-07T20:33:35.6169869Z                 op = torch.compile(op)
2025-05-07T20:33:35.6170206Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:35.6170569Z     
2025-05-07T20:33:35.6170757Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:35.6170921Z 
2025-05-07T20:33:35.6171022Z moe/activation_test.py:117: 
2025-05-07T20:33:35.6171316Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:35.6171639Z moe/activation_test.py:115: in fn
2025-05-07T20:33:35.6171915Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:35.6172490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:35.6173052Z     return fn(*args, **kwargs)
2025-05-07T20:33:35.6173696Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:35.6174370Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:35.6174917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:35.6175587Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:35.6176245Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:35.6176770Z     kernel = self.compile(
2025-05-07T20:33:35.6177322Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:35.6177960Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:35.6178355Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:35.6178587Z 
2025-05-07T20:33:35.6178788Z self = <triton.compiler.compiler.ASTSource object at 0x7f8819e31950>
2025-05-07T20:33:35.6179866Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:35.6181331Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8819ea2660>}
2025-05-07T20:33:35.6182660Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:35.6183694Z context = <triton._C.libtriton.ir.context object at 0x7f8819ddd670>
2025-05-07T20:33:35.6183980Z 
2025-05-07T20:33:35.6184149Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:35.6184670Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:35.6185132Z                            module_map=module_map)
2025-05-07T20:33:35.6192591Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:35.6192998Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:35.6193258Z E       ^
2025-05-07T20:33:35.6193722Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:35.6194252Z 
2025-05-07T20:33:35.6194726Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:35.6195232Z 
2025-05-07T20:33:35.6195342Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:35.6195744Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:35.6196141Z     T=4096,
2025-05-07T20:33:35.6196332Z     D=5120,
2025-05-07T20:33:35.6196524Z     scale_ub=1200.0,
2025-05-07T20:33:35.6196739Z     contiguous=True,
2025-05-07T20:33:35.6196962Z     compiled=True,
2025-05-07T20:33:35.6197167Z )
2025-05-07T20:33:35.6197483Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:35.6197972Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:35.6198282Z 
2025-05-07T20:33:35.6198369Z     @given(
2025-05-07T20:33:35.6198596Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:35.6198919Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:35.6199230Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:35.6199551Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:35.6199881Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:35.6200169Z     )
2025-05-07T20:33:35.6200515Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:35.6200954Z     def test_silu_mul_quant(
2025-05-07T20:33:35.6201200Z         self,
2025-05-07T20:33:35.6201400Z         T: int,
2025-05-07T20:33:35.6201591Z         D: int,
2025-05-07T20:33:35.6201813Z         scale_ub: Optional[float],
2025-05-07T20:33:35.6202084Z         contiguous: bool,
2025-05-07T20:33:35.6202317Z         compiled: bool,
2025-05-07T20:33:35.6202550Z     ) -> None:
2025-05-07T20:33:35.6202763Z         torch.manual_seed(2025)
2025-05-07T20:33:35.6202998Z     
2025-05-07T20:33:35.6203274Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:35.6203630Z     
2025-05-07T20:33:35.6203817Z         x_sign = torch.sign(x)
2025-05-07T20:33:35.6204104Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:35.6204414Z         x = x_sign * x_clamp
2025-05-07T20:33:35.6204646Z         x0 = x[:, :D]
2025-05-07T20:33:35.6204865Z         x1 = x[:, D:]
2025-05-07T20:33:35.6205073Z     
2025-05-07T20:33:35.6205260Z         if contiguous:
2025-05-07T20:33:35.6205483Z             x0 = x0.contiguous()
2025-05-07T20:33:35.6205741Z             x1 = x1.contiguous()
2025-05-07T20:33:35.6205976Z     
2025-05-07T20:33:35.6206163Z         if scale_ub is not None:
2025-05-07T20:33:35.6206436Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:35.6206765Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:35.6207075Z             )
2025-05-07T20:33:35.6207269Z         else:
2025-05-07T20:33:35.6207531Z             scale_ub_tensor = None
2025-05-07T20:33:35.6207773Z     
2025-05-07T20:33:35.6208010Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:35.6208322Z             op = silu_mul_quant
2025-05-07T20:33:35.6208571Z             if compiled:
2025-05-07T20:33:35.6208820Z                 op = torch.compile(op)
2025-05-07T20:33:35.6209114Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:35.6209382Z     
2025-05-07T20:33:35.6209578Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:35.6209746Z 
2025-05-07T20:33:35.6209844Z moe/activation_test.py:117: 
2025-05-07T20:33:35.6210141Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:35.6210464Z moe/activation_test.py:115: in fn
2025-05-07T20:33:35.6210750Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:35.6211323Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:35.6211871Z     return fn(*args, **kwargs)
2025-05-07T20:33:35.6212581Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:35.6213302Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:35.6213839Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:35.6214511Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:35.6215169Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:35.6215696Z     kernel = self.compile(
2025-05-07T20:33:35.6216268Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:35.6216990Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:35.6217392Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:35.6217619Z 
2025-05-07T20:33:35.6217836Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cabe44d0>
2025-05-07T20:33:35.6218907Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:35.6220298Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8819ea39c0>}
2025-05-07T20:33:35.6221678Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:35.6222746Z context = <triton._C.libtriton.ir.context object at 0x7f89cae8f030>
2025-05-07T20:33:35.6223028Z 
2025-05-07T20:33:35.6223204Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:35.6223723Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:35.6224194Z                            module_map=module_map)
2025-05-07T20:33:35.6224555Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:35.6224910Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:35.6225166Z E       ^
2025-05-07T20:33:35.6225627Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:35.6226068Z 
2025-05-07T20:33:35.6226490Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:35.7903685Z 
2025-05-07T20:33:35.7904253Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:35.7905430Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:35.7906488Z     T=128,
2025-05-07T20:33:35.7907237Z     D=5120,
2025-05-07T20:33:35.7907736Z     scale_ub=1200.0,
2025-05-07T20:33:35.7908172Z     contiguous=False,
2025-05-07T20:33:35.7908596Z     compiled=True,
2025-05-07T20:33:35.7908995Z )
2025-05-07T20:33:35.7909616Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:35.7910298Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:35.7910591Z 
2025-05-07T20:33:35.7910670Z     @given(
2025-05-07T20:33:35.7910897Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:35.7911198Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:35.7911502Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:35.7911828Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:35.7912152Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:35.7912424Z     )
2025-05-07T20:33:35.7912772Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:35.7913211Z     def test_silu_mul_quant(
2025-05-07T20:33:35.7913636Z         self,
2025-05-07T20:33:35.7913831Z         T: int,
2025-05-07T20:33:35.7914024Z         D: int,
2025-05-07T20:33:35.7914237Z         scale_ub: Optional[float],
2025-05-07T20:33:35.7914505Z         contiguous: bool,
2025-05-07T20:33:35.7914742Z         compiled: bool,
2025-05-07T20:33:35.7914961Z     ) -> None:
2025-05-07T20:33:35.7915174Z         torch.manual_seed(2025)
2025-05-07T20:33:35.7915414Z     
2025-05-07T20:33:35.7915674Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:35.7916012Z     
2025-05-07T20:33:35.7916198Z         x_sign = torch.sign(x)
2025-05-07T20:33:35.7916477Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:35.7916783Z         x = x_sign * x_clamp
2025-05-07T20:33:35.7917137Z         x0 = x[:, :D]
2025-05-07T20:33:35.7917356Z         x1 = x[:, D:]
2025-05-07T20:33:35.7917554Z     
2025-05-07T20:33:35.7917742Z         if contiguous:
2025-05-07T20:33:35.7917967Z             x0 = x0.contiguous()
2025-05-07T20:33:35.7918222Z             x1 = x1.contiguous()
2025-05-07T20:33:35.7918454Z     
2025-05-07T20:33:35.7918637Z         if scale_ub is not None:
2025-05-07T20:33:35.7918896Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:35.7919221Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:35.7919531Z             )
2025-05-07T20:33:35.7919714Z         else:
2025-05-07T20:33:35.7919917Z             scale_ub_tensor = None
2025-05-07T20:33:35.7920170Z     
2025-05-07T20:33:35.7920389Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:35.7920696Z             op = silu_mul_quant
2025-05-07T20:33:35.7920940Z             if compiled:
2025-05-07T20:33:35.7921178Z                 op = torch.compile(op)
2025-05-07T20:33:35.7921477Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:35.7921747Z     
2025-05-07T20:33:35.7921933Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:35.7922095Z 
2025-05-07T20:33:35.7922194Z moe/activation_test.py:117: 
2025-05-07T20:33:35.7922488Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:35.7922810Z moe/activation_test.py:115: in fn
2025-05-07T20:33:35.7923079Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:35.7923643Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:35.7924208Z     return fn(*args, **kwargs)
2025-05-07T20:33:35.7924859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:35.7925527Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:35.7926076Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:35.7926745Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:35.7927445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:35.7927974Z     kernel = self.compile(
2025-05-07T20:33:35.7928532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:35.7929176Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:35.7929561Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:35.7929789Z 
2025-05-07T20:33:35.7930015Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cbbe0b50>
2025-05-07T20:33:35.7931115Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:35.7932544Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89cae84fe0>}
2025-05-07T20:33:35.7933889Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:35.7934946Z context = <triton._C.libtriton.ir.context object at 0x7f89cae79670>
2025-05-07T20:33:35.7935234Z 
2025-05-07T20:33:35.7935397Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:35.7935911Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:35.7936404Z                            module_map=module_map)
2025-05-07T20:33:35.7936761Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:35.7937157Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:35.7937418Z E       ^
2025-05-07T20:33:35.7937877Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:35.7938349Z 
2025-05-07T20:33:35.7938774Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:35.7939282Z 
2025-05-07T20:33:35.7939383Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:35.7939793Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:35.7940484Z     T=16384,
2025-05-07T20:33:35.7940676Z     D=7168,
2025-05-07T20:33:35.7940865Z     scale_ub=1200.0,
2025-05-07T20:33:35.7941077Z     contiguous=True,
2025-05-07T20:33:35.7941294Z     compiled=True,
2025-05-07T20:33:35.7941497Z )
2025-05-07T20:33:35.7941815Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:35.7942305Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:35.7942579Z 
2025-05-07T20:33:35.7942660Z     @given(
2025-05-07T20:33:35.7942889Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:35.7943218Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:35.7943516Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:35.7943845Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:35.7944172Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:35.7944454Z     )
2025-05-07T20:33:35.7944798Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:35.7945244Z     def test_silu_mul_quant(
2025-05-07T20:33:35.7945486Z         self,
2025-05-07T20:33:35.7945669Z         T: int,
2025-05-07T20:33:35.7945864Z         D: int,
2025-05-07T20:33:35.7946076Z         scale_ub: Optional[float],
2025-05-07T20:33:35.7946340Z         contiguous: bool,
2025-05-07T20:33:35.7946592Z         compiled: bool,
2025-05-07T20:33:35.7946815Z     ) -> None:
2025-05-07T20:33:35.7947097Z         torch.manual_seed(2025)
2025-05-07T20:33:35.7947342Z     
2025-05-07T20:33:35.7947692Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:35.7948027Z     
2025-05-07T20:33:35.7948221Z         x_sign = torch.sign(x)
2025-05-07T20:33:35.7948509Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:35.7948805Z         x = x_sign * x_clamp
2025-05-07T20:33:35.7949045Z         x0 = x[:, :D]
2025-05-07T20:33:35.7949259Z         x1 = x[:, D:]
2025-05-07T20:33:35.7949455Z     
2025-05-07T20:33:35.7949636Z         if contiguous:
2025-05-07T20:33:35.7949863Z             x0 = x0.contiguous()
2025-05-07T20:33:35.7950114Z             x1 = x1.contiguous()
2025-05-07T20:33:35.7950350Z     
2025-05-07T20:33:35.7950542Z         if scale_ub is not None:
2025-05-07T20:33:35.7950812Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:35.7951136Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:35.7951442Z             )
2025-05-07T20:33:35.7951645Z         else:
2025-05-07T20:33:35.7951845Z             scale_ub_tensor = None
2025-05-07T20:33:35.7952222Z     
2025-05-07T20:33:35.7952453Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:35.7952750Z             op = silu_mul_quant
2025-05-07T20:33:35.7952999Z             if compiled:
2025-05-07T20:33:35.7953251Z                 op = torch.compile(op)
2025-05-07T20:33:35.7953536Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:35.7953810Z     
2025-05-07T20:33:35.7953998Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:35.7954161Z 
2025-05-07T20:33:35.7954256Z moe/activation_test.py:117: 
2025-05-07T20:33:35.7954554Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:35.7954877Z moe/activation_test.py:115: in fn
2025-05-07T20:33:35.7955221Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:35.7955783Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:35.7956343Z     return fn(*args, **kwargs)
2025-05-07T20:33:35.7957015Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:35.7957689Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:35.7958230Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:35.7958919Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:35.7959575Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:35.7960102Z     kernel = self.compile(
2025-05-07T20:33:35.7960653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:35.7961314Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:35.7961720Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:35.7961950Z 
2025-05-07T20:33:35.7962156Z self = <triton.compiler.compiler.ASTSource object at 0x7f8819fad850>
2025-05-07T20:33:35.7963215Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:35.7964569Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89cae85e40>}
2025-05-07T20:33:35.7965889Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:35.7966891Z context = <triton._C.libtriton.ir.context object at 0x7f89ca3330b0>
2025-05-07T20:33:35.7967227Z 
2025-05-07T20:33:35.7967392Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:35.7967906Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:35.7968367Z                            module_map=module_map)
2025-05-07T20:33:35.7968724Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:35.7969079Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:35.7969331Z E       ^
2025-05-07T20:33:35.7969785Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:35.7970239Z 
2025-05-07T20:33:35.7970672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:35.9133537Z 
2025-05-07T20:33:35.9134243Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:35.9134945Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:35.9135722Z     T=16384,
2025-05-07T20:33:35.9136016Z     D=5120,
2025-05-07T20:33:35.9136212Z     scale_ub=1200.0,
2025-05-07T20:33:35.9136430Z     contiguous=True,
2025-05-07T20:33:35.9136653Z     compiled=False,
2025-05-07T20:33:35.9136859Z )
2025-05-07T20:33:35.9137167Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:35.9137674Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:35.9137952Z 
2025-05-07T20:33:35.9138034Z     @given(
2025-05-07T20:33:35.9138262Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:35.9138564Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:35.9138866Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:35.9139290Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:35.9139608Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:35.9139889Z     )
2025-05-07T20:33:35.9140527Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:35.9140973Z     def test_silu_mul_quant(
2025-05-07T20:33:35.9141206Z         self,
2025-05-07T20:33:35.9141401Z         T: int,
2025-05-07T20:33:35.9141590Z         D: int,
2025-05-07T20:33:35.9141798Z         scale_ub: Optional[float],
2025-05-07T20:33:35.9142067Z         contiguous: bool,
2025-05-07T20:33:35.9142309Z         compiled: bool,
2025-05-07T20:33:35.9142526Z     ) -> None:
2025-05-07T20:33:35.9142736Z         torch.manual_seed(2025)
2025-05-07T20:33:35.9142978Z     
2025-05-07T20:33:35.9143240Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:35.9143581Z     
2025-05-07T20:33:35.9143769Z         x_sign = torch.sign(x)
2025-05-07T20:33:35.9144050Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:35.9144372Z         x = x_sign * x_clamp
2025-05-07T20:33:35.9144613Z         x0 = x[:, :D]
2025-05-07T20:33:35.9144822Z         x1 = x[:, D:]
2025-05-07T20:33:35.9145030Z     
2025-05-07T20:33:35.9145222Z         if contiguous:
2025-05-07T20:33:35.9145452Z             x0 = x0.contiguous()
2025-05-07T20:33:35.9145707Z             x1 = x1.contiguous()
2025-05-07T20:33:35.9145944Z     
2025-05-07T20:33:35.9146127Z         if scale_ub is not None:
2025-05-07T20:33:35.9146399Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:35.9146730Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:35.9147024Z             )
2025-05-07T20:33:35.9147206Z         else:
2025-05-07T20:33:35.9147499Z             scale_ub_tensor = None
2025-05-07T20:33:35.9147746Z     
2025-05-07T20:33:35.9147968Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:35.9148272Z             op = silu_mul_quant
2025-05-07T20:33:35.9148517Z             if compiled:
2025-05-07T20:33:35.9148753Z                 op = torch.compile(op)
2025-05-07T20:33:35.9149044Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:35.9149422Z     
2025-05-07T20:33:35.9149605Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:35.9149776Z 
2025-05-07T20:33:35.9149874Z moe/activation_test.py:117: 
2025-05-07T20:33:35.9150165Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:35.9150491Z moe/activation_test.py:115: in fn
2025-05-07T20:33:35.9150765Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:35.9151474Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:35.9152152Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:35.9152679Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:35.9153351Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:35.9154008Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:35.9154530Z     kernel = self.compile(
2025-05-07T20:33:35.9155206Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:35.9155863Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:35.9156252Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:35.9156474Z 
2025-05-07T20:33:35.9156676Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cbbe2850>
2025-05-07T20:33:35.9157737Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:35.9159214Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89cae86ca0>}
2025-05-07T20:33:35.9160592Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:35.9161663Z context = <triton._C.libtriton.ir.context object at 0x7f89ca328ab0>
2025-05-07T20:33:35.9161952Z 
2025-05-07T20:33:35.9162116Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:35.9162631Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:35.9163096Z                            module_map=module_map)
2025-05-07T20:33:35.9163462Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:35.9163809Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:35.9164069Z E       ^
2025-05-07T20:33:35.9164526Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:35.9164977Z 
2025-05-07T20:33:35.9165406Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:35.9165917Z 
2025-05-07T20:33:35.9166017Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:35.9166423Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:35.9166819Z     T=1,
2025-05-07T20:33:35.9166993Z     D=7168,
2025-05-07T20:33:35.9167186Z     scale_ub=1200.0,
2025-05-07T20:33:35.9167409Z     contiguous=False,
2025-05-07T20:33:35.9167626Z     compiled=False,
2025-05-07T20:33:35.9167828Z )
2025-05-07T20:33:35.9168148Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:35.9168616Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:35.9168883Z 
2025-05-07T20:33:35.9168958Z     @given(
2025-05-07T20:33:35.9169180Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:35.9169538Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:35.9169839Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:35.9170160Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:35.9170483Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:35.9170759Z     )
2025-05-07T20:33:35.9171110Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:35.9171557Z     def test_silu_mul_quant(
2025-05-07T20:33:35.9171791Z         self,
2025-05-07T20:33:35.9171984Z         T: int,
2025-05-07T20:33:35.9172176Z         D: int,
2025-05-07T20:33:35.9172383Z         scale_ub: Optional[float],
2025-05-07T20:33:35.9172649Z         contiguous: bool,
2025-05-07T20:33:35.9172888Z         compiled: bool,
2025-05-07T20:33:35.9173099Z     ) -> None:
2025-05-07T20:33:35.9173314Z         torch.manual_seed(2025)
2025-05-07T20:33:35.9173549Z     
2025-05-07T20:33:35.9173810Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:35.9174142Z     
2025-05-07T20:33:35.9174422Z         x_sign = torch.sign(x)
2025-05-07T20:33:35.9174713Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:35.9175018Z         x = x_sign * x_clamp
2025-05-07T20:33:35.9175255Z         x0 = x[:, :D]
2025-05-07T20:33:35.9175466Z         x1 = x[:, D:]
2025-05-07T20:33:35.9175663Z     
2025-05-07T20:33:35.9175842Z         if contiguous:
2025-05-07T20:33:35.9176084Z             x0 = x0.contiguous()
2025-05-07T20:33:35.9176334Z             x1 = x1.contiguous()
2025-05-07T20:33:35.9176576Z     
2025-05-07T20:33:35.9176761Z         if scale_ub is not None:
2025-05-07T20:33:35.9177029Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:35.9177365Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:35.9185564Z             )
2025-05-07T20:33:35.9185773Z         else:
2025-05-07T20:33:35.9185992Z             scale_ub_tensor = None
2025-05-07T20:33:35.9186245Z     
2025-05-07T20:33:35.9186490Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:35.9186822Z             op = silu_mul_quant
2025-05-07T20:33:35.9187081Z             if compiled:
2025-05-07T20:33:35.9187331Z                 op = torch.compile(op)
2025-05-07T20:33:35.9187724Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:35.9187992Z     
2025-05-07T20:33:35.9188180Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:35.9188353Z 
2025-05-07T20:33:35.9188454Z moe/activation_test.py:117: 
2025-05-07T20:33:35.9188755Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:35.9189077Z moe/activation_test.py:115: in fn
2025-05-07T20:33:35.9189360Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:35.9190050Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:35.9190761Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:35.9191290Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:35.9192509Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:35.9193213Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:35.9193733Z     kernel = self.compile(
2025-05-07T20:33:35.9194278Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:35.9194946Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:35.9195338Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:35.9195559Z 
2025-05-07T20:33:35.9195762Z self = <triton.compiler.compiler.ASTSource object at 0x7f89ca1e5bd0>
2025-05-07T20:33:35.9196925Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:35.9198307Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca3940e0>}
﻿2025-05-07T20:33:35.9202768Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:35.9203774Z context = <triton._C.libtriton.ir.context object at 0x7f8819cad370>
2025-05-07T20:33:35.9204065Z 
2025-05-07T20:33:35.9204227Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:35.9204739Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:35.9205198Z                            module_map=module_map)
2025-05-07T20:33:35.9205566Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:35.9205985Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:35.9206244Z E       ^
2025-05-07T20:33:35.9206704Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:35.9207165Z 
2025-05-07T20:33:35.9207605Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:35.9208107Z 
2025-05-07T20:33:35.9208212Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:35.9208612Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:35.9209011Z     T=4096,
2025-05-07T20:33:35.9209195Z     D=7168,
2025-05-07T20:33:35.9209463Z     scale_ub=1200.0,
2025-05-07T20:33:35.9209691Z     contiguous=False,
2025-05-07T20:33:35.9209923Z     compiled=True,
2025-05-07T20:33:36.0817792Z )
2025-05-07T20:33:36.0818333Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:36.0819018Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:36.0819377Z 
2025-05-07T20:33:36.0819462Z     @given(
2025-05-07T20:33:36.0819691Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:36.0820005Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:36.0820319Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:36.0820638Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:36.0820959Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:36.0821237Z     )
2025-05-07T20:33:36.0821572Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:36.0822024Z     def test_silu_mul_quant(
2025-05-07T20:33:36.0822273Z         self,
2025-05-07T20:33:36.0822463Z         T: int,
2025-05-07T20:33:36.0822689Z         D: int,
2025-05-07T20:33:36.0822903Z         scale_ub: Optional[float],
2025-05-07T20:33:36.0823172Z         contiguous: bool,
2025-05-07T20:33:36.0823415Z         compiled: bool,
2025-05-07T20:33:36.0823636Z     ) -> None:
2025-05-07T20:33:36.0823852Z         torch.manual_seed(2025)
2025-05-07T20:33:36.0824095Z     
2025-05-07T20:33:36.0824360Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:36.0824715Z     
2025-05-07T20:33:36.0824911Z         x_sign = torch.sign(x)
2025-05-07T20:33:36.0825192Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:36.0825497Z         x = x_sign * x_clamp
2025-05-07T20:33:36.0825743Z         x0 = x[:, :D]
2025-05-07T20:33:36.0825962Z         x1 = x[:, D:]
2025-05-07T20:33:36.0826161Z     
2025-05-07T20:33:36.0826346Z         if contiguous:
2025-05-07T20:33:36.0826575Z             x0 = x0.contiguous()
2025-05-07T20:33:36.0826830Z             x1 = x1.contiguous()
2025-05-07T20:33:36.0827072Z     
2025-05-07T20:33:36.0827266Z         if scale_ub is not None:
2025-05-07T20:33:36.0827894Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:36.0828240Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:36.0828558Z             )
2025-05-07T20:33:36.0828745Z         else:
2025-05-07T20:33:36.0828953Z             scale_ub_tensor = None
2025-05-07T20:33:36.0829201Z     
2025-05-07T20:33:36.0829422Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:36.0829878Z             op = silu_mul_quant
2025-05-07T20:33:36.0830129Z             if compiled:
2025-05-07T20:33:36.0830367Z                 op = torch.compile(op)
2025-05-07T20:33:36.0830664Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:36.0830934Z     
2025-05-07T20:33:36.0831130Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:36.0831293Z 
2025-05-07T20:33:36.0831390Z moe/activation_test.py:117: 
2025-05-07T20:33:36.0831686Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:36.0832011Z moe/activation_test.py:115: in fn
2025-05-07T20:33:36.0832290Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:36.0832947Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:36.0833505Z     return fn(*args, **kwargs)
2025-05-07T20:33:36.0834161Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:36.0834847Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:36.0835388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:36.0836068Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:36.0836722Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:36.0837338Z     kernel = self.compile(
2025-05-07T20:33:36.0837896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:36.0838548Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:36.0838946Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:36.0839175Z 
2025-05-07T20:33:36.0839379Z self = <triton.compiler.compiler.ASTSource object at 0x7f8819f20150>
2025-05-07T20:33:36.0840797Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:36.0842182Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca395300>}
2025-05-07T20:33:36.0843512Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:36.0844533Z context = <triton._C.libtriton.ir.context object at 0x7f8819ca0630>
2025-05-07T20:33:36.0844824Z 
2025-05-07T20:33:36.0844990Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:36.0845514Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:36.0845984Z                            module_map=module_map)
2025-05-07T20:33:36.0846348Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:36.0846794Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:36.0847056Z E       ^
2025-05-07T20:33:36.0847519Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:36.0847986Z 
2025-05-07T20:33:36.0848579Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:36.0849114Z 
2025-05-07T20:33:36.0849228Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:36.0849631Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:36.0850045Z     T=128,
2025-05-07T20:33:36.0850244Z     D=7168,
2025-05-07T20:33:36.0850513Z     scale_ub=1200.0,
2025-05-07T20:33:36.0850847Z     contiguous=False,
2025-05-07T20:33:36.0851074Z     compiled=True,
2025-05-07T20:33:36.0851280Z )
2025-05-07T20:33:36.0851599Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:36.0852084Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:36.0852350Z 
2025-05-07T20:33:36.0852433Z     @given(
2025-05-07T20:33:36.0852655Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:36.0852969Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:36.0853273Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:36.0853601Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:36.0853993Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:36.0854278Z     )
2025-05-07T20:33:36.0854624Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:36.0855063Z     def test_silu_mul_quant(
2025-05-07T20:33:36.0855312Z         self,
2025-05-07T20:33:36.0855505Z         T: int,
2025-05-07T20:33:36.0855692Z         D: int,
2025-05-07T20:33:36.0855908Z         scale_ub: Optional[float],
2025-05-07T20:33:36.0856175Z         contiguous: bool,
2025-05-07T20:33:36.0856406Z         compiled: bool,
2025-05-07T20:33:36.0856630Z     ) -> None:
2025-05-07T20:33:36.0856842Z         torch.manual_seed(2025)
2025-05-07T20:33:36.0857076Z     
2025-05-07T20:33:36.0857415Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:36.0857753Z     
2025-05-07T20:33:36.0857937Z         x_sign = torch.sign(x)
2025-05-07T20:33:36.0858230Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:36.0858540Z         x = x_sign * x_clamp
2025-05-07T20:33:36.0858781Z         x0 = x[:, :D]
2025-05-07T20:33:36.0858993Z         x1 = x[:, D:]
2025-05-07T20:33:36.0859200Z     
2025-05-07T20:33:36.0859377Z         if contiguous:
2025-05-07T20:33:36.0859601Z             x0 = x0.contiguous()
2025-05-07T20:33:36.0859862Z             x1 = x1.contiguous()
2025-05-07T20:33:36.0860105Z     
2025-05-07T20:33:36.0860289Z         if scale_ub is not None:
2025-05-07T20:33:36.0860559Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:36.0860896Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:36.0861195Z             )
2025-05-07T20:33:36.0861393Z         else:
2025-05-07T20:33:36.0861608Z             scale_ub_tensor = None
2025-05-07T20:33:36.0861857Z     
2025-05-07T20:33:36.0862104Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:36.0862412Z             op = silu_mul_quant
2025-05-07T20:33:36.0862659Z             if compiled:
2025-05-07T20:33:36.0862910Z                 op = torch.compile(op)
2025-05-07T20:33:36.0863213Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:36.0863488Z     
2025-05-07T20:33:36.0863672Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:36.0863849Z 
2025-05-07T20:33:36.0863951Z moe/activation_test.py:117: 
2025-05-07T20:33:36.0864259Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:36.0864590Z moe/activation_test.py:115: in fn
2025-05-07T20:33:36.0864873Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:36.0865435Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:36.0865993Z     return fn(*args, **kwargs)
2025-05-07T20:33:36.0866657Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:36.0867389Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:36.0868001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:36.0868676Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:36.0869345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:36.0869927Z     kernel = self.compile(
2025-05-07T20:33:36.0870521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:36.0871159Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:36.0871554Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:36.0871794Z 
2025-05-07T20:33:36.0872007Z self = <triton.compiler.compiler.ASTSource object at 0x7f8819b91cd0>
2025-05-07T20:33:36.0873155Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:36.0874863Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca396160>}
2025-05-07T20:33:36.0876533Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:36.0877784Z context = <triton._C.libtriton.ir.context object at 0x7f8819cf8770>
2025-05-07T20:33:36.0878120Z 
2025-05-07T20:33:36.0878311Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:36.0878959Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:36.0879522Z                            module_map=module_map)
2025-05-07T20:33:36.0879946Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:36.0880343Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:36.0880626Z E       ^
2025-05-07T20:33:36.0881173Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:36.0881723Z 
2025-05-07T20:33:36.0882235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:36.0882861Z 
2025-05-07T20:33:36.0882972Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:36.0883443Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:36.0883901Z     T=2048,
2025-05-07T20:33:36.0884104Z     D=7168,
2025-05-07T20:33:36.0884301Z     scale_ub=None,
2025-05-07T20:33:36.0884528Z     contiguous=True,
2025-05-07T20:33:36.0884771Z     compiled=True,
2025-05-07T20:33:36.2098977Z )
2025-05-07T20:33:36.2099520Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:36.2100189Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:36.2100548Z 
2025-05-07T20:33:36.2100649Z     @given(
2025-05-07T20:33:36.2100940Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:36.2101253Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:36.2101560Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:36.2101884Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:36.2102292Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:36.2102583Z     )
2025-05-07T20:33:36.2102928Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:36.2103380Z     def test_silu_mul_quant(
2025-05-07T20:33:36.2103625Z         self,
2025-05-07T20:33:36.2103816Z         T: int,
2025-05-07T20:33:36.2104293Z         D: int,
2025-05-07T20:33:36.2104526Z         scale_ub: Optional[float],
2025-05-07T20:33:36.2104797Z         contiguous: bool,
2025-05-07T20:33:36.2105043Z         compiled: bool,
2025-05-07T20:33:36.2105275Z     ) -> None:
2025-05-07T20:33:36.2105490Z         torch.manual_seed(2025)
2025-05-07T20:33:36.2105738Z     
2025-05-07T20:33:36.2106008Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:36.2106459Z     
2025-05-07T20:33:36.2106643Z         x_sign = torch.sign(x)
2025-05-07T20:33:36.2106933Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:36.2107250Z         x = x_sign * x_clamp
2025-05-07T20:33:36.2107584Z         x0 = x[:, :D]
2025-05-07T20:33:36.2107797Z         x1 = x[:, D:]
2025-05-07T20:33:36.2108006Z     
2025-05-07T20:33:36.2108181Z         if contiguous:
2025-05-07T20:33:36.2108412Z             x0 = x0.contiguous()
2025-05-07T20:33:36.2108694Z             x1 = x1.contiguous()
2025-05-07T20:33:36.2108932Z     
2025-05-07T20:33:36.2109121Z         if scale_ub is not None:
2025-05-07T20:33:36.2109533Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:36.2109867Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:36.2110167Z             )
2025-05-07T20:33:36.2110347Z         else:
2025-05-07T20:33:36.2110549Z             scale_ub_tensor = None
2025-05-07T20:33:36.2110802Z     
2025-05-07T20:33:36.2111023Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:36.2111324Z             op = silu_mul_quant
2025-05-07T20:33:36.2111570Z             if compiled:
2025-05-07T20:33:36.2111806Z                 op = torch.compile(op)
2025-05-07T20:33:36.2112093Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:36.2112355Z     
2025-05-07T20:33:36.2112535Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:36.2112791Z 
2025-05-07T20:33:36.2112887Z moe/activation_test.py:117: 
2025-05-07T20:33:36.2113182Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:36.2113508Z moe/activation_test.py:115: in fn
2025-05-07T20:33:36.2113786Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:36.2114371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:36.2114940Z     return fn(*args, **kwargs)
2025-05-07T20:33:36.2115589Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:36.2116276Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:36.2116810Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:36.2117481Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:36.2118133Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:36.2118664Z     kernel = self.compile(
2025-05-07T20:33:36.2119235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:36.2119883Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:36.2120275Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:36.2120515Z 
2025-05-07T20:33:36.2120721Z self = <triton.compiler.compiler.ASTSource object at 0x7f8819face50>
2025-05-07T20:33:36.2121825Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:36.2123250Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca397420>}
2025-05-07T20:33:36.2124634Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:36.2125643Z context = <triton._C.libtriton.ir.context object at 0x7f88198f45b0>
2025-05-07T20:33:36.2125936Z 
2025-05-07T20:33:36.2126101Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:36.2126678Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:36.2127150Z                            module_map=module_map)
2025-05-07T20:33:36.2127514Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:36.2127876Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:36.2128136Z E       ^
2025-05-07T20:33:36.2128597Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:36.2129049Z 
2025-05-07T20:33:36.2129523Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:36.2130028Z 
2025-05-07T20:33:36.2130137Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:36.2130538Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:36.2130936Z     T=16384,
2025-05-07T20:33:36.2131129Z     D=5120,
2025-05-07T20:33:36.2131313Z     scale_ub=None,
2025-05-07T20:33:36.2131532Z     contiguous=False,
2025-05-07T20:33:36.2131757Z     compiled=False,
2025-05-07T20:33:36.2131948Z )
2025-05-07T20:33:36.2132272Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:36.2132764Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:36.2133036Z 
2025-05-07T20:33:36.2133169Z     @given(
2025-05-07T20:33:36.2133393Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:36.2133702Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:36.2134010Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:36.2134335Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:36.2134665Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:36.2134945Z     )
2025-05-07T20:33:36.2135281Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:36.2135735Z     def test_silu_mul_quant(
2025-05-07T20:33:36.2135976Z         self,
2025-05-07T20:33:36.2136175Z         T: int,
2025-05-07T20:33:36.2136377Z         D: int,
2025-05-07T20:33:36.2136594Z         scale_ub: Optional[float],
2025-05-07T20:33:36.2136874Z         contiguous: bool,
2025-05-07T20:33:36.2137106Z         compiled: bool,
2025-05-07T20:33:36.2137335Z     ) -> None:
2025-05-07T20:33:36.2137552Z         torch.manual_seed(2025)
2025-05-07T20:33:36.2137795Z     
2025-05-07T20:33:36.2138065Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:36.2138404Z     
2025-05-07T20:33:36.2138596Z         x_sign = torch.sign(x)
2025-05-07T20:33:36.2138897Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:36.2141419Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:36.2143316Z 
2025-05-07T20:33:36.2143438Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:36.2143656Z 
2025-05-07T20:33:36.2143771Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:36.2144268Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:36.2144685Z     T=4096,
2025-05-07T20:33:36.2144883Z     D=7168,
2025-05-07T20:33:36.2145075Z     scale_ub=1200.0,
2025-05-07T20:33:36.2145314Z     contiguous=True,
2025-05-07T20:33:36.2145542Z     compiled=True,
2025-05-07T20:33:36.2145744Z )
2025-05-07T20:33:36.2146058Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:36.2146614Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:36.2146887Z 
2025-05-07T20:33:36.2146971Z     @given(
2025-05-07T20:33:36.2147187Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:36.2147572Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:36.2147883Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:36.2148198Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:36.2148522Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:36.2148804Z     )
2025-05-07T20:33:36.2149144Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:36.2149653Z     def test_silu_mul_quant(
2025-05-07T20:33:36.2149902Z         self,
2025-05-07T20:33:36.2150085Z         T: int,
2025-05-07T20:33:36.2150277Z         D: int,
2025-05-07T20:33:36.2150491Z         scale_ub: Optional[float],
2025-05-07T20:33:36.2150755Z         contiguous: bool,
2025-05-07T20:33:36.2150993Z         compiled: bool,
2025-05-07T20:33:36.2151217Z     ) -> None:
2025-05-07T20:33:36.2151425Z         torch.manual_seed(2025)
2025-05-07T20:33:36.2151655Z     
2025-05-07T20:33:36.2151917Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:36.2152257Z     
2025-05-07T20:33:36.2152436Z         x_sign = torch.sign(x)
2025-05-07T20:33:36.2152720Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:36.2155209Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:36.2157081Z 
2025-05-07T20:33:36.2157208Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:36.2157414Z 
2025-05-07T20:33:36.2157518Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:36.2157920Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:36.2158343Z     T=16384,
2025-05-07T20:33:36.2158535Z     D=7168,
2025-05-07T20:33:36.2158720Z     scale_ub=None,
2025-05-07T20:33:36.2158934Z     contiguous=False,
2025-05-07T20:33:36.2159161Z     compiled=False,
2025-05-07T20:33:36.2159359Z )
2025-05-07T20:33:36.2159679Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:36.2160174Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:36.2160443Z 
2025-05-07T20:33:36.2168593Z     @given(
2025-05-07T20:33:36.2168860Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:36.2169177Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:36.2169472Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:36.2169792Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:36.2170110Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:36.2170387Z     )
2025-05-07T20:33:36.2170727Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:36.2171165Z     def test_silu_mul_quant(
2025-05-07T20:33:36.2171399Z         self,
2025-05-07T20:33:36.2171584Z         T: int,
2025-05-07T20:33:36.2171778Z         D: int,
2025-05-07T20:33:36.2172072Z         scale_ub: Optional[float],
2025-05-07T20:33:36.2172338Z         contiguous: bool,
2025-05-07T20:33:36.2172580Z         compiled: bool,
2025-05-07T20:33:36.2172798Z     ) -> None:
2025-05-07T20:33:36.2173004Z         torch.manual_seed(2025)
2025-05-07T20:33:36.2173240Z     
2025-05-07T20:33:36.2173504Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:36.2175632Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:36.2177506Z 
2025-05-07T20:33:36.2177627Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:36.3409572Z 
2025-05-07T20:33:36.3409990Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:36.3410590Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:36.3411096Z     T=2048,
2025-05-07T20:33:36.3411289Z     D=7168,
2025-05-07T20:33:36.3411485Z     scale_ub=1200.0,
2025-05-07T20:33:36.3411705Z     contiguous=True,
2025-05-07T20:33:36.3411923Z     compiled=True,
2025-05-07T20:33:36.3412124Z )
2025-05-07T20:33:36.3412446Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:36.3412936Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:36.3413200Z 
2025-05-07T20:33:36.3413275Z     @given(
2025-05-07T20:33:36.3413624Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:36.3413933Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:36.3414247Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:36.3414571Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:36.3414897Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:36.3415185Z     )
2025-05-07T20:33:36.3415523Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:36.3415968Z     def test_silu_mul_quant(
2025-05-07T20:33:36.3416214Z         self,
2025-05-07T20:33:36.3416402Z         T: int,
2025-05-07T20:33:36.3416598Z         D: int,
2025-05-07T20:33:36.3416816Z         scale_ub: Optional[float],
2025-05-07T20:33:36.3417081Z         contiguous: bool,
2025-05-07T20:33:36.3417319Z         compiled: bool,
2025-05-07T20:33:36.3417548Z     ) -> None:
2025-05-07T20:33:36.3417755Z         torch.manual_seed(2025)
2025-05-07T20:33:36.3418000Z     
2025-05-07T20:33:36.3418278Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:36.3418623Z     
2025-05-07T20:33:36.3418821Z         x_sign = torch.sign(x)
2025-05-07T20:33:36.3419121Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:36.3421263Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:36.3423205Z 
2025-05-07T20:33:36.3423327Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:36.3423540Z 
2025-05-07T20:33:36.3423645Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:36.3424071Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:36.3424581Z     T=2048,
2025-05-07T20:33:36.3424775Z     D=7168,
2025-05-07T20:33:36.3424961Z     scale_ub=None,
2025-05-07T20:33:36.3425173Z     contiguous=True,
2025-05-07T20:33:36.3425395Z     compiled=False,
2025-05-07T20:33:36.3425592Z )
2025-05-07T20:33:36.3425911Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:36.3426503Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:36.3426781Z 
2025-05-07T20:33:36.3426858Z     @given(
2025-05-07T20:33:36.3427084Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:36.3427483Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:36.3427781Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:36.3428110Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:36.3428439Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:36.3428724Z     )
2025-05-07T20:33:36.3429067Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:36.3429549Z     def test_silu_mul_quant(
2025-05-07T20:33:36.3429792Z         self,
2025-05-07T20:33:36.3430018Z         T: int,
2025-05-07T20:33:36.3430213Z         D: int,
2025-05-07T20:33:36.3430420Z         scale_ub: Optional[float],
2025-05-07T20:33:36.3430689Z         contiguous: bool,
2025-05-07T20:33:36.3430929Z         compiled: bool,
2025-05-07T20:33:36.3431143Z     ) -> None:
2025-05-07T20:33:36.3431360Z         torch.manual_seed(2025)
2025-05-07T20:33:36.3431599Z     
2025-05-07T20:33:36.3431861Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:36.3432211Z     
2025-05-07T20:33:36.3432400Z >       x_sign = torch.sign(x)
2025-05-07T20:33:36.3434400Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:36.3436403Z 
2025-05-07T20:33:36.3436521Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:36.3436732Z 
2025-05-07T20:33:36.3436834Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:36.3437241Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:36.3437634Z     T=1,
2025-05-07T20:33:36.3437811Z     D=7168,
2025-05-07T20:33:36.3437997Z     scale_ub=1200.0,
2025-05-07T20:33:36.3438215Z     contiguous=True,
2025-05-07T20:33:36.3438430Z     compiled=False,
2025-05-07T20:33:36.3438634Z )
2025-05-07T20:33:36.3438946Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:36.3439422Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:36.3439688Z 
2025-05-07T20:33:36.3439768Z     @given(
2025-05-07T20:33:36.3439997Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:36.3440595Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:36.3440899Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:36.3441229Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:36.3441551Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:36.3441832Z     )
2025-05-07T20:33:36.3442175Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:36.3442627Z     def test_silu_mul_quant(
2025-05-07T20:33:36.3442869Z         self,
2025-05-07T20:33:36.3443064Z         T: int,
2025-05-07T20:33:36.3443266Z         D: int,
2025-05-07T20:33:36.3443480Z         scale_ub: Optional[float],
2025-05-07T20:33:36.3443755Z         contiguous: bool,
2025-05-07T20:33:36.3444074Z         compiled: bool,
2025-05-07T20:33:36.3444293Z     ) -> None:
2025-05-07T20:33:36.3444519Z         torch.manual_seed(2025)
2025-05-07T20:33:36.3444757Z     
2025-05-07T20:33:36.3445019Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:36.3445362Z     
2025-05-07T20:33:36.3445556Z         x_sign = torch.sign(x)
2025-05-07T20:33:36.3445915Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:36.3446213Z         x = x_sign * x_clamp
2025-05-07T20:33:36.3446460Z         x0 = x[:, :D]
2025-05-07T20:33:36.3446671Z         x1 = x[:, D:]
2025-05-07T20:33:36.3446871Z     
2025-05-07T20:33:36.3447053Z         if contiguous:
2025-05-07T20:33:36.3447282Z             x0 = x0.contiguous()
2025-05-07T20:33:36.3447531Z             x1 = x1.contiguous()
2025-05-07T20:33:36.3447771Z     
2025-05-07T20:33:36.3447957Z         if scale_ub is not None:
2025-05-07T20:33:36.3448225Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:36.3448559Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:36.3448869Z             )
2025-05-07T20:33:36.3449124Z         else:
2025-05-07T20:33:36.3449335Z             scale_ub_tensor = None
2025-05-07T20:33:36.3449588Z     
2025-05-07T20:33:36.3449810Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:36.3450141Z             op = silu_mul_quant
2025-05-07T20:33:36.3450440Z             if compiled:
2025-05-07T20:33:36.3450686Z                 op = torch.compile(op)
2025-05-07T20:33:36.3450982Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:36.3451257Z     
2025-05-07T20:33:36.3451453Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:36.3451614Z 
2025-05-07T20:33:36.3451712Z moe/activation_test.py:117: 
2025-05-07T20:33:36.3452005Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:36.3452399Z moe/activation_test.py:115: in fn
2025-05-07T20:33:36.3452680Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:36.3453379Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:36.3454068Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:36.3454611Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:36.3455292Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:36.3455954Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:36.3456490Z     kernel = self.compile(
2025-05-07T20:33:36.3457053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:36.3457729Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:36.3458138Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:36.3458365Z 
2025-05-07T20:33:36.3458580Z self = <triton.compiler.compiler.ASTSource object at 0x7f881994c850>
2025-05-07T20:33:36.3459676Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:36.3461039Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f88199aa2a0>}
2025-05-07T20:33:36.3462365Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:36.3463383Z context = <triton._C.libtriton.ir.context object at 0x7f88199f64b0>
2025-05-07T20:33:36.3463674Z 
2025-05-07T20:33:36.3463894Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:36.3464418Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:36.3464888Z                            module_map=module_map)
2025-05-07T20:33:36.3465259Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:36.3465604Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:36.3465918Z E       ^
2025-05-07T20:33:36.3466379Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:36.3466831Z 
2025-05-07T20:33:36.3467262Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:36.3467837Z 
2025-05-07T20:33:36.3467939Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:36.3468345Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:36.3468738Z     T=128,
2025-05-07T20:33:36.3468921Z     D=5120,
2025-05-07T20:33:36.3469110Z     scale_ub=None,
2025-05-07T20:33:36.3469395Z     contiguous=True,
2025-05-07T20:33:36.3469614Z     compiled=False,
2025-05-07T20:33:36.3469814Z )
2025-05-07T20:33:36.3470133Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:36.3470614Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:36.3470887Z 
2025-05-07T20:33:36.3470963Z     @given(
2025-05-07T20:33:36.3471210Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:36.3471516Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:36.3471817Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:36.3472141Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:36.3472470Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:36.3472793Z     )
2025-05-07T20:33:36.3473147Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:36.3473608Z     def test_silu_mul_quant(
2025-05-07T20:33:36.3473846Z         self,
2025-05-07T20:33:36.3474037Z         T: int,
2025-05-07T20:33:36.3474234Z         D: int,
2025-05-07T20:33:36.3474446Z         scale_ub: Optional[float],
2025-05-07T20:33:36.3474724Z         contiguous: bool,
2025-05-07T20:33:36.3474964Z         compiled: bool,
2025-05-07T20:33:36.3475179Z     ) -> None:
2025-05-07T20:33:36.3475393Z         torch.manual_seed(2025)
2025-05-07T20:33:36.3475635Z     
2025-05-07T20:33:36.3475896Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:36.3476231Z     
2025-05-07T20:33:36.3476431Z         x_sign = torch.sign(x)
2025-05-07T20:33:36.3476716Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:36.3477013Z         x = x_sign * x_clamp
2025-05-07T20:33:36.3477256Z         x0 = x[:, :D]
2025-05-07T20:33:36.3477471Z         x1 = x[:, D:]
2025-05-07T20:33:36.3477671Z     
2025-05-07T20:33:36.3477856Z         if contiguous:
2025-05-07T20:33:36.3478084Z             x0 = x0.contiguous()
2025-05-07T20:33:36.3478331Z             x1 = x1.contiguous()
2025-05-07T20:33:36.3478563Z     
2025-05-07T20:33:36.3478749Z         if scale_ub is not None:
2025-05-07T20:33:36.3479014Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:36.3479346Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:36.3479655Z             )
2025-05-07T20:33:36.3479841Z         else:
2025-05-07T20:33:36.3480050Z             scale_ub_tensor = None
2025-05-07T20:33:36.3480298Z     
2025-05-07T20:33:36.3480518Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:36.3480824Z             op = silu_mul_quant
2025-05-07T20:33:36.3481075Z             if compiled:
2025-05-07T20:33:36.3481321Z                 op = torch.compile(op)
2025-05-07T20:33:36.3481613Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:36.3481879Z     
2025-05-07T20:33:36.3482073Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:36.3482283Z 
2025-05-07T20:33:36.3482382Z moe/activation_test.py:117: 
2025-05-07T20:33:36.3482676Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:36.3483003Z moe/activation_test.py:115: in fn
2025-05-07T20:33:36.3483274Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:36.3483958Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:36.3484686Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:36.3485235Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:36.3485903Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:36.3486563Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:36.3487087Z     kernel = self.compile(
2025-05-07T20:33:36.3487671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:36.3488319Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:36.3488716Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:36.3488943Z 
2025-05-07T20:33:36.3489155Z self = <triton.compiler.compiler.ASTSource object at 0x7f8819b90950>
2025-05-07T20:33:36.3490311Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:36.3491665Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f88199ab1a0>}
2025-05-07T20:33:36.3493037Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:36.3494044Z context = <triton._C.libtriton.ir.context object at 0x7f8819b7e830>
2025-05-07T20:33:36.3494324Z 
2025-05-07T20:33:36.3494494Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:36.3495012Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:36.3495491Z                            module_map=module_map)
2025-05-07T20:33:36.3495855Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:36.3496200Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:36.3496457Z E       ^
2025-05-07T20:33:36.3496923Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:36.3497375Z 
2025-05-07T20:33:36.3497799Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:36.4632412Z 
2025-05-07T20:33:36.4633018Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:36.4633628Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:36.4634168Z     T=128,
2025-05-07T20:33:36.4634382Z     D=7168,
2025-05-07T20:33:36.4634580Z     scale_ub=None,
2025-05-07T20:33:36.4634784Z     contiguous=True,
2025-05-07T20:33:36.4635003Z     compiled=False,
2025-05-07T20:33:36.4635206Z )
2025-05-07T20:33:36.4635522Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:36.4636002Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:36.4636272Z 
2025-05-07T20:33:36.4636349Z     @given(
2025-05-07T20:33:36.4636580Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:36.4636878Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:36.4637486Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:36.4637814Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:36.4638131Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:36.4638406Z     )
2025-05-07T20:33:36.4638745Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:36.4639315Z     def test_silu_mul_quant(
2025-05-07T20:33:36.4639553Z         self,
2025-05-07T20:33:36.4639734Z         T: int,
2025-05-07T20:33:36.4639930Z         D: int,
2025-05-07T20:33:36.4640409Z         scale_ub: Optional[float],
2025-05-07T20:33:36.4640795Z         contiguous: bool,
2025-05-07T20:33:36.4641033Z         compiled: bool,
2025-05-07T20:33:36.4641262Z     ) -> None:
2025-05-07T20:33:36.4641471Z         torch.manual_seed(2025)
2025-05-07T20:33:36.4641713Z     
2025-05-07T20:33:36.4641987Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:36.4642319Z     
2025-05-07T20:33:36.4642516Z         x_sign = torch.sign(x)
2025-05-07T20:33:36.4642905Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:36.4643212Z         x = x_sign * x_clamp
2025-05-07T20:33:36.4643449Z         x0 = x[:, :D]
2025-05-07T20:33:36.4643671Z         x1 = x[:, D:]
2025-05-07T20:33:36.4643873Z     
2025-05-07T20:33:36.4644050Z         if contiguous:
2025-05-07T20:33:36.4644280Z             x0 = x0.contiguous()
2025-05-07T20:33:36.4644534Z             x1 = x1.contiguous()
2025-05-07T20:33:36.4644765Z     
2025-05-07T20:33:36.4644953Z         if scale_ub is not None:
2025-05-07T20:33:36.4645226Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:36.4645548Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:36.4645857Z             )
2025-05-07T20:33:36.4646044Z         else:
2025-05-07T20:33:36.4646340Z             scale_ub_tensor = None
2025-05-07T20:33:36.4646587Z     
2025-05-07T20:33:36.4646811Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:36.4647115Z             op = silu_mul_quant
2025-05-07T20:33:36.4647357Z             if compiled:
2025-05-07T20:33:36.4647602Z                 op = torch.compile(op)
2025-05-07T20:33:36.4647890Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:36.4648163Z     
2025-05-07T20:33:36.4648353Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:36.4648515Z 
2025-05-07T20:33:36.4648621Z moe/activation_test.py:117: 
2025-05-07T20:33:36.4648909Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:36.4649230Z moe/activation_test.py:115: in fn
2025-05-07T20:33:36.4649508Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:36.4650208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:36.4650888Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:36.4651439Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:36.4652117Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:36.4652762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:36.4653292Z     kernel = self.compile(
2025-05-07T20:33:36.4653852Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:36.4654493Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:36.4654885Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:36.4655119Z 
2025-05-07T20:33:36.4655321Z self = <triton.compiler.compiler.ASTSource object at 0x7f89ca1e6950>
2025-05-07T20:33:36.4656507Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:36.4657883Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8819b78040>}
2025-05-07T20:33:36.4659202Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:36.4660345Z context = <triton._C.libtriton.ir.context object at 0x7f8819bb0370>
2025-05-07T20:33:36.4660624Z 
2025-05-07T20:33:36.4660795Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:36.4661309Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:36.4661766Z                            module_map=module_map)
2025-05-07T20:33:36.4662133Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:36.4662492Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:36.4662788Z E       ^
2025-05-07T20:33:36.4663254Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:36.4663714Z 
2025-05-07T20:33:36.4664150Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:36.4664651Z 
2025-05-07T20:33:36.4664758Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:36.4665156Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:36.4665553Z     T=2048,
2025-05-07T20:33:36.4665740Z     D=7168,
2025-05-07T20:33:36.4665922Z     scale_ub=1200.0,
2025-05-07T20:33:36.4666142Z     contiguous=True,
2025-05-07T20:33:36.4666410Z     compiled=False,
2025-05-07T20:33:36.4666610Z )
2025-05-07T20:33:36.4666929Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:36.4667500Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:36.4667772Z 
2025-05-07T20:33:36.4667853Z     @given(
2025-05-07T20:33:36.4668069Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:36.4668374Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:36.4668675Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:36.4668996Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:36.4669321Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:36.4669601Z     )
2025-05-07T20:33:36.4669945Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:36.4670373Z     def test_silu_mul_quant(
2025-05-07T20:33:36.4670612Z         self,
2025-05-07T20:33:36.4670795Z         T: int,
2025-05-07T20:33:36.4670996Z         D: int,
2025-05-07T20:33:36.4671208Z         scale_ub: Optional[float],
2025-05-07T20:33:36.4671475Z         contiguous: bool,
2025-05-07T20:33:36.4671709Z         compiled: bool,
2025-05-07T20:33:36.4671932Z     ) -> None:
2025-05-07T20:33:36.4672146Z         torch.manual_seed(2025)
2025-05-07T20:33:36.4672377Z     
2025-05-07T20:33:36.4672640Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:36.4674664Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:36.4676608Z 
2025-05-07T20:33:36.4676727Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:36.4676933Z 
2025-05-07T20:33:36.4677082Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:36.4677493Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:36.4677894Z     T=1,
2025-05-07T20:33:36.4678074Z     D=5120,
2025-05-07T20:33:36.4678254Z     scale_ub=1200.0,
2025-05-07T20:33:36.4678473Z     contiguous=True,
2025-05-07T20:33:36.4678704Z     compiled=False,
2025-05-07T20:33:36.4678949Z )
2025-05-07T20:33:36.4679269Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:36.4679748Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:36.4680006Z 
2025-05-07T20:33:36.4680084Z     @given(
2025-05-07T20:33:36.4680310Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:36.4680614Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:36.4680911Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:36.4689707Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:36.4690059Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:36.4690460Z     )
2025-05-07T20:33:36.4690807Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:36.4691261Z     def test_silu_mul_quant(
2025-05-07T20:33:36.4691505Z         self,
2025-05-07T20:33:36.4691699Z         T: int,
2025-05-07T20:33:36.4691902Z         D: int,
2025-05-07T20:33:36.4692118Z         scale_ub: Optional[float],
2025-05-07T20:33:36.4692381Z         contiguous: bool,
2025-05-07T20:33:36.4692631Z         compiled: bool,
2025-05-07T20:33:36.4692857Z     ) -> None:
2025-05-07T20:33:36.4693074Z         torch.manual_seed(2025)
2025-05-07T20:33:36.4693309Z     
2025-05-07T20:33:36.4693580Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:36.4693973Z     
2025-05-07T20:33:36.4694160Z         x_sign = torch.sign(x)
2025-05-07T20:33:36.4694452Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:36.4694757Z         x = x_sign * x_clamp
2025-05-07T20:33:36.4694986Z         x0 = x[:, :D]
2025-05-07T20:33:36.4695193Z         x1 = x[:, D:]
2025-05-07T20:33:36.4695400Z     
2025-05-07T20:33:36.4695574Z         if contiguous:
2025-05-07T20:33:36.4695804Z             x0 = x0.contiguous()
2025-05-07T20:33:36.4696060Z             x1 = x1.contiguous()
2025-05-07T20:33:36.4696293Z     
2025-05-07T20:33:36.4696481Z         if scale_ub is not None:
2025-05-07T20:33:36.4696753Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:36.4697076Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:36.4697379Z             )
2025-05-07T20:33:36.4697569Z         else:
2025-05-07T20:33:36.4697777Z             scale_ub_tensor = None
2025-05-07T20:33:36.4698022Z     
2025-05-07T20:33:36.4698249Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:36.4698563Z             op = silu_mul_quant
2025-05-07T20:33:36.4698801Z             if compiled:
2025-05-07T20:33:36.4699053Z                 op = torch.compile(op)
2025-05-07T20:33:36.4699351Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:36.4699615Z     
2025-05-07T20:33:36.4699800Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:36.4699964Z 
2025-05-07T20:33:36.4700068Z moe/activation_test.py:117: 
2025-05-07T20:33:36.4700360Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:36.4700690Z moe/activation_test.py:115: in fn
2025-05-07T20:33:36.4700974Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:36.4701677Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:36.4702356Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:36.4702917Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:36.4703592Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:36.4704303Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:36.4704837Z     kernel = self.compile(
2025-05-07T20:33:36.4705387Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:36.4706060Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:36.4706500Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:36.4706729Z 
2025-05-07T20:33:36.4706934Z self = <triton.compiler.compiler.ASTSource object at 0x7f8819b918d0>
2025-05-07T20:33:36.4708067Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:36.4709478Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8819b79580>}
2025-05-07T20:33:36.4710801Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:36.4711816Z context = <triton._C.libtriton.ir.context object at 0x7f881980cb70>
2025-05-07T20:33:36.4712108Z 
2025-05-07T20:33:36.4712272Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:36.4712794Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:36.4713262Z                            module_map=module_map)
2025-05-07T20:33:36.4713671Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:36.4714029Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:36.4714288Z E       ^
2025-05-07T20:33:36.4714762Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:36.4715218Z 
2025-05-07T20:33:36.4715638Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:36.5563093Z 
2025-05-07T20:33:36.5563394Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:36.5564030Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:36.5564431Z     T=2048,
2025-05-07T20:33:36.5564623Z     D=5120,
2025-05-07T20:33:36.5564814Z     scale_ub=None,
2025-05-07T20:33:36.5565025Z     contiguous=True,
2025-05-07T20:33:36.5565251Z     compiled=False,
2025-05-07T20:33:36.5565465Z )
2025-05-07T20:33:36.5565792Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:36.5566293Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:36.5566574Z 
2025-05-07T20:33:36.5566657Z     @given(
2025-05-07T20:33:36.5566892Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:36.5567197Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:36.5567535Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:36.5567867Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:36.5568195Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:36.5568477Z     )
2025-05-07T20:33:36.5568818Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:36.5569278Z     def test_silu_mul_quant(
2025-05-07T20:33:36.5569519Z         self,
2025-05-07T20:33:36.5569709Z         T: int,
2025-05-07T20:33:36.5569908Z         D: int,
2025-05-07T20:33:36.5570130Z         scale_ub: Optional[float],
2025-05-07T20:33:36.5570399Z         contiguous: bool,
2025-05-07T20:33:36.5570640Z         compiled: bool,
2025-05-07T20:33:36.5570864Z     ) -> None:
2025-05-07T20:33:36.5571297Z         torch.manual_seed(2025)
2025-05-07T20:33:36.5571540Z     
2025-05-07T20:33:36.5571811Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:36.5572150Z     
2025-05-07T20:33:36.5572339Z >       x_sign = torch.sign(x)
2025-05-07T20:33:36.5574268Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:36.5576283Z 
2025-05-07T20:33:36.5576399Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:36.5576608Z 
2025-05-07T20:33:36.5576721Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:36.5577203Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:36.5577619Z     T=16384,
2025-05-07T20:33:36.5577815Z     D=5120,
2025-05-07T20:33:36.5577995Z     scale_ub=None,
2025-05-07T20:33:36.5578210Z     contiguous=True,
2025-05-07T20:33:36.5578433Z     compiled=False,
2025-05-07T20:33:36.5578632Z )
2025-05-07T20:33:36.5578949Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:36.5579455Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:36.5579729Z 
2025-05-07T20:33:36.5579813Z     @given(
2025-05-07T20:33:36.5580034Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:36.5580354Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:36.5580748Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:36.5581066Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:36.5581393Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:36.5581675Z     )
2025-05-07T20:33:36.5582016Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:36.5582462Z     def test_silu_mul_quant(
2025-05-07T20:33:36.5582704Z         self,
2025-05-07T20:33:36.5582890Z         T: int,
2025-05-07T20:33:36.5583087Z         D: int,
2025-05-07T20:33:36.5583311Z         scale_ub: Optional[float],
2025-05-07T20:33:36.5583580Z         contiguous: bool,
2025-05-07T20:33:36.5583815Z         compiled: bool,
2025-05-07T20:33:36.5584037Z     ) -> None:
2025-05-07T20:33:36.5584252Z         torch.manual_seed(2025)
2025-05-07T20:33:36.5584489Z     
2025-05-07T20:33:36.5584757Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:36.5586822Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:36.5588837Z 
2025-05-07T20:33:36.5588962Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:36.5589170Z 
2025-05-07T20:33:36.5589276Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:36.5589686Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:36.5590081Z     T=4096,
2025-05-07T20:33:36.5590271Z     D=5120,
2025-05-07T20:33:36.5590453Z     scale_ub=None,
2025-05-07T20:33:36.5590671Z     contiguous=True,
2025-05-07T20:33:36.5590900Z     compiled=False,
2025-05-07T20:33:36.5591096Z )
2025-05-07T20:33:36.5591461Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:36.5591949Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:36.5592212Z 
2025-05-07T20:33:36.5592291Z     @given(
2025-05-07T20:33:36.5592514Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:36.5592818Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:36.5593180Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:36.5593509Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:36.5593834Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:36.5594116Z     )
2025-05-07T20:33:36.5594460Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:36.5594909Z     def test_silu_mul_quant(
2025-05-07T20:33:36.5595151Z         self,
2025-05-07T20:33:36.5595338Z         T: int,
2025-05-07T20:33:36.5595534Z         D: int,
2025-05-07T20:33:36.5595763Z         scale_ub: Optional[float],
2025-05-07T20:33:36.5596028Z         contiguous: bool,
2025-05-07T20:33:36.5596270Z         compiled: bool,
2025-05-07T20:33:36.5596541Z     ) -> None:
2025-05-07T20:33:36.5596754Z         torch.manual_seed(2025)
2025-05-07T20:33:36.5597000Z     
2025-05-07T20:33:36.5597275Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:36.5599322Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:36.5601326Z 
2025-05-07T20:33:36.5601440Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:36.5601657Z 
2025-05-07T20:33:36.5601761Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:36.5602185Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:36.5602578Z     T=2048,
2025-05-07T20:33:36.5602761Z     D=5120,
2025-05-07T20:33:36.5602954Z     scale_ub=None,
2025-05-07T20:33:36.5603170Z     contiguous=False,
2025-05-07T20:33:36.5603396Z     compiled=False,
2025-05-07T20:33:36.5603611Z )
2025-05-07T20:33:36.5603938Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:36.5604435Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:36.5604716Z 
2025-05-07T20:33:36.5604796Z     @given(
2025-05-07T20:33:36.5605023Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:36.5605339Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:36.5605635Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:36.5605962Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:36.5606291Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:36.5606565Z     )
2025-05-07T20:33:36.5606910Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:36.5607349Z     def test_silu_mul_quant(
2025-05-07T20:33:36.5607587Z         self,
2025-05-07T20:33:36.5607780Z         T: int,
2025-05-07T20:33:36.5607986Z         D: int,
2025-05-07T20:33:36.5608198Z         scale_ub: Optional[float],
2025-05-07T20:33:36.5608473Z         contiguous: bool,
2025-05-07T20:33:36.5608712Z         compiled: bool,
2025-05-07T20:33:36.5608931Z     ) -> None:
2025-05-07T20:33:36.5609149Z         torch.manual_seed(2025)
2025-05-07T20:33:36.5609387Z     
2025-05-07T20:33:36.5609646Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:36.5611877Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:36.5613866Z 
2025-05-07T20:33:36.5613980Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:36.5614194Z 
2025-05-07T20:33:36.5614293Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:36.5614705Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:36.5615109Z     T=4096,
2025-05-07T20:33:36.5615298Z     D=7168,
2025-05-07T20:33:36.5615485Z     scale_ub=None,
2025-05-07T20:33:36.5615693Z     contiguous=True,
2025-05-07T20:33:36.5615922Z     compiled=True,
2025-05-07T20:33:36.5616126Z )
2025-05-07T20:33:36.5616449Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:36.5616985Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:36.5617263Z 
2025-05-07T20:33:36.5617345Z     @given(
2025-05-07T20:33:36.5617572Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:36.5617882Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:36.5618200Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:36.5618531Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:36.5618861Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:36.5619150Z     )
2025-05-07T20:33:36.5619500Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:36.5619978Z     def test_silu_mul_quant(
2025-05-07T20:33:36.5620220Z         self,
2025-05-07T20:33:36.5620413Z         T: int,
2025-05-07T20:33:36.5620619Z         D: int,
2025-05-07T20:33:36.5620840Z         scale_ub: Optional[float],
2025-05-07T20:33:36.5621114Z         contiguous: bool,
2025-05-07T20:33:36.5621363Z         compiled: bool,
2025-05-07T20:33:36.5621584Z     ) -> None:
2025-05-07T20:33:36.5621807Z         torch.manual_seed(2025)
2025-05-07T20:33:36.5622051Z     
2025-05-07T20:33:36.5622316Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:36.5624349Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:36.5626222Z 
2025-05-07T20:33:36.5626341Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:36.5626555Z 
2025-05-07T20:33:36.5626658Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:36.5627066Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:36.5627529Z     T=2048,
2025-05-07T20:33:36.5627715Z     D=5120,
2025-05-07T20:33:36.5627905Z     scale_ub=1200.0,
2025-05-07T20:33:36.5628120Z     contiguous=False,
2025-05-07T20:33:36.5628348Z     compiled=False,
2025-05-07T20:33:36.6184305Z )
2025-05-07T20:33:36.6185272Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:36.6186311Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:36.6186854Z 
2025-05-07T20:33:36.6187008Z     @given(
2025-05-07T20:33:36.6187575Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:36.6188176Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:36.6189053Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:36.6189697Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:36.6190180Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:36.6190453Z     )
2025-05-07T20:33:36.6190789Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:36.6191221Z     def test_silu_mul_quant(
2025-05-07T20:33:36.6191537Z         self,
2025-05-07T20:33:36.6191724Z         T: int,
2025-05-07T20:33:36.6191912Z         D: int,
2025-05-07T20:33:36.6192128Z         scale_ub: Optional[float],
2025-05-07T20:33:36.6192394Z         contiguous: bool,
2025-05-07T20:33:36.6192635Z         compiled: bool,
2025-05-07T20:33:36.6192860Z     ) -> None:
2025-05-07T20:33:36.6193070Z         torch.manual_seed(2025)
2025-05-07T20:33:36.6193306Z     
2025-05-07T20:33:36.6193579Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:36.6195828Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:36.6197793Z 
2025-05-07T20:33:36.6197917Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:36.6198124Z 
2025-05-07T20:33:36.6198232Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:36.6198634Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:36.6199114Z     T=4096,
2025-05-07T20:33:36.6199327Z     D=7168,
2025-05-07T20:33:36.6199521Z     scale_ub=1200.0,
2025-05-07T20:33:36.6199745Z     contiguous=True,
2025-05-07T20:33:36.6199960Z     compiled=False,
2025-05-07T20:33:36.6200165Z )
2025-05-07T20:33:36.6200483Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:36.6200962Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:36.6201231Z 
2025-05-07T20:33:36.6201302Z     @given(
2025-05-07T20:33:36.6201522Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:36.6201820Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:36.6202119Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:36.6202442Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:36.6202755Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:36.6203035Z     )
2025-05-07T20:33:36.6203390Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:36.6203837Z     def test_silu_mul_quant(
2025-05-07T20:33:36.6204068Z         self,
2025-05-07T20:33:36.6204263Z         T: int,
2025-05-07T20:33:36.6204453Z         D: int,
2025-05-07T20:33:36.6204661Z         scale_ub: Optional[float],
2025-05-07T20:33:36.6204928Z         contiguous: bool,
2025-05-07T20:33:36.6205168Z         compiled: bool,
2025-05-07T20:33:36.6205388Z     ) -> None:
2025-05-07T20:33:36.6205597Z         torch.manual_seed(2025)
2025-05-07T20:33:36.6205830Z     
2025-05-07T20:33:36.6206091Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:36.6208225Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:36.6210102Z 
2025-05-07T20:33:36.6210220Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:36.6210436Z 
2025-05-07T20:33:36.6210538Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:36.6210942Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:36.6211372Z     T=16384,
2025-05-07T20:33:36.6211558Z     D=7168,
2025-05-07T20:33:36.6211746Z     scale_ub=None,
2025-05-07T20:33:36.6211948Z     contiguous=False,
2025-05-07T20:33:36.6212167Z     compiled=True,
2025-05-07T20:33:36.6212364Z )
2025-05-07T20:33:36.6212672Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:36.6213166Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:36.6213438Z 
2025-05-07T20:33:36.6213519Z     @given(
2025-05-07T20:33:36.6213742Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:36.6214056Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:36.6214409Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:36.6214732Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:36.6215045Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:36.6215329Z     )
2025-05-07T20:33:36.6215682Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:36.6216125Z     def test_silu_mul_quant(
2025-05-07T20:33:36.6216364Z         self,
2025-05-07T20:33:36.6216557Z         T: int,
2025-05-07T20:33:36.6216742Z         D: int,
2025-05-07T20:33:36.6216964Z         scale_ub: Optional[float],
2025-05-07T20:33:36.6217233Z         contiguous: bool,
2025-05-07T20:33:36.6217464Z         compiled: bool,
2025-05-07T20:33:36.6217681Z     ) -> None:
2025-05-07T20:33:36.6217939Z         torch.manual_seed(2025)
2025-05-07T20:33:36.6218174Z     
2025-05-07T20:33:36.6218432Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:36.6220446Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:36.6222303Z 
2025-05-07T20:33:36.6222416Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:36.6222620Z 
2025-05-07T20:33:36.6222725Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:36.6223129Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:36.6223536Z     T=4096,
2025-05-07T20:33:36.6223717Z     D=7168,
2025-05-07T20:33:36.6223902Z     scale_ub=None,
2025-05-07T20:33:36.6224108Z     contiguous=True,
2025-05-07T20:33:36.6224332Z     compiled=False,
2025-05-07T20:33:36.6224530Z )
2025-05-07T20:33:36.6224839Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:36.6225343Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:36.6225609Z 
2025-05-07T20:33:36.6225692Z     @given(
2025-05-07T20:33:36.6225909Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:36.6226212Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:36.6226511Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:36.6226821Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:36.6227143Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:36.6227492Z     )
2025-05-07T20:33:36.6227832Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:36.6228280Z     def test_silu_mul_quant(
2025-05-07T20:33:36.6228564Z         self,
2025-05-07T20:33:36.6228755Z         T: int,
2025-05-07T20:33:36.6228940Z         D: int,
2025-05-07T20:33:36.6229153Z         scale_ub: Optional[float],
2025-05-07T20:33:36.6229435Z         contiguous: bool,
2025-05-07T20:33:36.6229665Z         compiled: bool,
2025-05-07T20:33:36.6229887Z     ) -> None:
2025-05-07T20:33:36.6230165Z         torch.manual_seed(2025)
2025-05-07T20:33:36.6230423Z     
2025-05-07T20:33:36.6230693Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:36.6232733Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:36.6234589Z 
2025-05-07T20:33:36.6234703Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:36.6234911Z 
2025-05-07T20:33:36.6235015Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:36.6235414Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:36.6235826Z     T=16384,
2025-05-07T20:33:36.6244805Z     D=7168,
2025-05-07T20:33:36.6245009Z     scale_ub=None,
2025-05-07T20:33:36.6245226Z     contiguous=True,
2025-05-07T20:33:36.6245449Z     compiled=False,
2025-05-07T20:33:36.6245644Z )
2025-05-07T20:33:36.6245963Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:36.6246610Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:36.6246888Z 
2025-05-07T20:33:36.6246972Z     @given(
2025-05-07T20:33:36.6247196Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:36.6247520Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:36.6247826Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:36.6248145Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:36.6248464Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:36.6248748Z     )
2025-05-07T20:33:36.6249083Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:36.6249518Z     def test_silu_mul_quant(
2025-05-07T20:33:36.6249756Z         self,
2025-05-07T20:33:36.6249947Z         T: int,
2025-05-07T20:33:36.6250139Z         D: int,
2025-05-07T20:33:36.6250356Z         scale_ub: Optional[float],
2025-05-07T20:33:36.6250625Z         contiguous: bool,
2025-05-07T20:33:36.6250855Z         compiled: bool,
2025-05-07T20:33:36.6251077Z     ) -> None:
2025-05-07T20:33:36.6251286Z         torch.manual_seed(2025)
2025-05-07T20:33:36.6251517Z     
2025-05-07T20:33:36.6251788Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:36.6253822Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:36.6255809Z 
2025-05-07T20:33:36.6255933Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:36.6256138Z 
2025-05-07T20:33:36.6256245Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:36.6256648Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:36.6257114Z     T=16384,
2025-05-07T20:33:36.6257303Z     D=7168,
2025-05-07T20:33:36.6257485Z     scale_ub=1200.0,
2025-05-07T20:33:36.6257705Z     contiguous=True,
2025-05-07T20:33:36.6257926Z     compiled=False,
2025-05-07T20:33:36.6258118Z )
2025-05-07T20:33:36.6258430Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:36.6258914Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:36.6259256Z 
2025-05-07T20:33:36.6259332Z     @given(
2025-05-07T20:33:36.6259551Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:36.6259866Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:36.6260166Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:36.6260481Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:36.6260802Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:36.6261081Z     )
2025-05-07T20:33:36.6261418Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:36.6261861Z     def test_silu_mul_quant(
2025-05-07T20:33:36.6262210Z         self,
2025-05-07T20:33:36.6262396Z         T: int,
2025-05-07T20:33:36.6262591Z         D: int,
2025-05-07T20:33:36.6262806Z         scale_ub: Optional[float],
2025-05-07T20:33:36.6263067Z         contiguous: bool,
2025-05-07T20:33:36.6263300Z         compiled: bool,
2025-05-07T20:33:36.6263522Z     ) -> None:
2025-05-07T20:33:36.6263729Z         torch.manual_seed(2025)
2025-05-07T20:33:36.6263965Z     
2025-05-07T20:33:36.6264232Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:36.6266285Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:36.6268282Z 
2025-05-07T20:33:36.6268402Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:36.8065128Z 
2025-05-07T20:33:36.8065720Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:36.8066959Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:36.8067878Z     T=128,
2025-05-07T20:33:36.8068302Z     D=5120,
2025-05-07T20:33:36.8068673Z     scale_ub=1200.0,
2025-05-07T20:33:36.8069097Z     contiguous=False,
2025-05-07T20:33:36.8069533Z     compiled=False,
2025-05-07T20:33:36.8069933Z )
2025-05-07T20:33:36.8070381Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:36.8070905Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:36.8071191Z 
2025-05-07T20:33:36.8071280Z     @given(
2025-05-07T20:33:36.8071516Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:36.8071821Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:36.8072124Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:36.8072452Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:36.8072770Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:36.8073064Z     )
2025-05-07T20:33:36.8073411Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:36.8073867Z     def test_silu_mul_quant(
2025-05-07T20:33:36.8074113Z         self,
2025-05-07T20:33:36.8074308Z         T: int,
2025-05-07T20:33:36.8074498Z         D: int,
2025-05-07T20:33:36.8074715Z         scale_ub: Optional[float],
2025-05-07T20:33:36.8074984Z         contiguous: bool,
2025-05-07T20:33:36.8075232Z         compiled: bool,
2025-05-07T20:33:36.8075450Z     ) -> None:
2025-05-07T20:33:36.8075948Z         torch.manual_seed(2025)
2025-05-07T20:33:36.8076189Z     
2025-05-07T20:33:36.8076447Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:36.8076785Z     
2025-05-07T20:33:36.8076969Z         x_sign = torch.sign(x)
2025-05-07T20:33:36.8077250Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:36.8077560Z         x = x_sign * x_clamp
2025-05-07T20:33:36.8077879Z         x0 = x[:, :D]
2025-05-07T20:33:36.8078085Z         x1 = x[:, D:]
2025-05-07T20:33:36.8078288Z     
2025-05-07T20:33:36.8078470Z         if contiguous:
2025-05-07T20:33:36.8078692Z             x0 = x0.contiguous()
2025-05-07T20:33:36.8078942Z             x1 = x1.contiguous()
2025-05-07T20:33:36.8079176Z     
2025-05-07T20:33:36.8079355Z         if scale_ub is not None:
2025-05-07T20:33:36.8079627Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:36.8079953Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:36.8080262Z             )
2025-05-07T20:33:36.8080448Z         else:
2025-05-07T20:33:36.8080740Z             scale_ub_tensor = None
2025-05-07T20:33:36.8080993Z     
2025-05-07T20:33:36.8081212Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:36.8081521Z             op = silu_mul_quant
2025-05-07T20:33:36.8081764Z             if compiled:
2025-05-07T20:33:36.8082001Z                 op = torch.compile(op)
2025-05-07T20:33:36.8082300Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:36.8082571Z     
2025-05-07T20:33:36.8082751Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:36.8082919Z 
2025-05-07T20:33:36.8083016Z moe/activation_test.py:117: 
2025-05-07T20:33:36.8083308Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:36.8083626Z moe/activation_test.py:115: in fn
2025-05-07T20:33:36.8083996Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:36.8084694Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:36.8085376Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:36.8085910Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:36.8086580Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:36.8087253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:36.8087778Z     kernel = self.compile(
2025-05-07T20:33:36.8088332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:36.8088973Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:36.8089364Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:36.8089592Z 
2025-05-07T20:33:36.8089796Z self = <triton.compiler.compiler.ASTSource object at 0x7f8819b930d0>
2025-05-07T20:33:36.8090867Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:36.8092239Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8819876e80>}
2025-05-07T20:33:36.8093597Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:36.8094646Z context = <triton._C.libtriton.ir.context object at 0x7f8819763d70>
2025-05-07T20:33:36.8094926Z 
2025-05-07T20:33:36.8095089Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:36.8095657Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:36.8096119Z                            module_map=module_map)
2025-05-07T20:33:36.8096476Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:36.8096818Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:36.8097077Z E       ^
2025-05-07T20:33:36.8097533Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:36.8098016Z 
2025-05-07T20:33:36.8098434Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:36.8098939Z 
2025-05-07T20:33:36.8099039Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:36.8099443Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:36.8099840Z     T=2048,
2025-05-07T20:33:36.8100020Z     D=7168,
2025-05-07T20:33:36.8100210Z     scale_ub=None,
2025-05-07T20:33:36.8100426Z     contiguous=False,
2025-05-07T20:33:36.8100640Z     compiled=False,
2025-05-07T20:33:36.8100889Z )
2025-05-07T20:33:36.8101209Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:36.8101701Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:36.8101975Z 
2025-05-07T20:33:36.8102052Z     @given(
2025-05-07T20:33:36.8102275Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:36.8102574Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:36.8102873Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:36.8103196Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:36.8103515Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:36.8103787Z     )
2025-05-07T20:33:36.8104173Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:36.8104624Z     def test_silu_mul_quant(
2025-05-07T20:33:36.8104858Z         self,
2025-05-07T20:33:36.8105053Z         T: int,
2025-05-07T20:33:36.8105250Z         D: int,
2025-05-07T20:33:36.8105462Z         scale_ub: Optional[float],
2025-05-07T20:33:36.8105732Z         contiguous: bool,
2025-05-07T20:33:36.8105971Z         compiled: bool,
2025-05-07T20:33:36.8106186Z     ) -> None:
2025-05-07T20:33:36.8106400Z         torch.manual_seed(2025)
2025-05-07T20:33:36.8106642Z     
2025-05-07T20:33:36.8106907Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:36.8109020Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:36.8110980Z 
2025-05-07T20:33:36.8111095Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:36.8111308Z 
2025-05-07T20:33:36.8111409Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:36.8111814Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:36.8112208Z     T=128,
2025-05-07T20:33:36.8112418Z     D=7168,
2025-05-07T20:33:36.8112601Z     scale_ub=1200.0,
2025-05-07T20:33:36.8112821Z     contiguous=True,
2025-05-07T20:33:36.8113041Z     compiled=True,
2025-05-07T20:33:36.8113233Z )
2025-05-07T20:33:36.8113548Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:36.8114037Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:36.8114311Z 
2025-05-07T20:33:36.8114397Z     @given(
2025-05-07T20:33:36.8114616Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:36.8114975Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:36.8115281Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:36.8115597Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:36.8115920Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:36.8116204Z     )
2025-05-07T20:33:36.8116555Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:36.8117034Z     def test_silu_mul_quant(
2025-05-07T20:33:36.8117275Z         self,
2025-05-07T20:33:36.8117469Z         T: int,
2025-05-07T20:33:36.8117656Z         D: int,
2025-05-07T20:33:36.8117867Z         scale_ub: Optional[float],
2025-05-07T20:33:36.8118132Z         contiguous: bool,
2025-05-07T20:33:36.8118363Z         compiled: bool,
2025-05-07T20:33:36.8118586Z     ) -> None:
2025-05-07T20:33:36.8118795Z         torch.manual_seed(2025)
2025-05-07T20:33:36.8119027Z     
2025-05-07T20:33:36.8119301Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:36.8119642Z     
2025-05-07T20:33:36.8119878Z         x_sign = torch.sign(x)
2025-05-07T20:33:36.8120168Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:36.8120468Z         x = x_sign * x_clamp
2025-05-07T20:33:36.8120702Z         x0 = x[:, :D]
2025-05-07T20:33:36.8120921Z         x1 = x[:, D:]
2025-05-07T20:33:36.8121131Z     
2025-05-07T20:33:36.8121309Z         if contiguous:
2025-05-07T20:33:36.8121536Z             x0 = x0.contiguous()
2025-05-07T20:33:36.8121790Z             x1 = x1.contiguous()
2025-05-07T20:33:36.8122020Z     
2025-05-07T20:33:36.8122213Z         if scale_ub is not None:
2025-05-07T20:33:36.8122476Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:36.8122810Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:36.8123160Z             )
2025-05-07T20:33:36.8123350Z         else:
2025-05-07T20:33:36.8123557Z             scale_ub_tensor = None
2025-05-07T20:33:36.8123803Z     
2025-05-07T20:33:36.8124030Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:36.8124345Z             op = silu_mul_quant
2025-05-07T20:33:36.8124589Z             if compiled:
2025-05-07T20:33:36.8124840Z                 op = torch.compile(op)
2025-05-07T20:33:36.8125142Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:36.8125412Z     
2025-05-07T20:33:36.8125608Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:36.8125771Z 
2025-05-07T20:33:36.8125873Z moe/activation_test.py:117: 
2025-05-07T20:33:36.8126163Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:36.8126488Z moe/activation_test.py:115: in fn
2025-05-07T20:33:36.8126774Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:36.8127333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:36.8127881Z     return fn(*args, **kwargs)
2025-05-07T20:33:36.8128542Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:36.8129234Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:36.8129762Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:36.8130484Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:36.8131170Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:36.8131699Z     kernel = self.compile(
2025-05-07T20:33:36.8132252Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:36.8132924Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:36.8133322Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:36.8133543Z 
2025-05-07T20:33:36.8133803Z self = <triton.compiler.compiler.ASTSource object at 0x7f8819b6c9d0>
2025-05-07T20:33:36.8134859Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:36.8136275Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f88197c7b00>}
2025-05-07T20:33:36.8137629Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:36.8138689Z context = <triton._C.libtriton.ir.context object at 0x7f8819660ab0>
2025-05-07T20:33:36.8138972Z 
2025-05-07T20:33:36.8139139Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:36.8139698Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:36.8140370Z                            module_map=module_map)
2025-05-07T20:33:36.8140728Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:36.8141065Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:36.8141320Z E       ^
2025-05-07T20:33:36.8141777Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:36.8142222Z 
2025-05-07T20:33:36.8142652Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.0907707Z 
2025-05-07T20:33:37.0907988Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.0908849Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.0909273Z     T=128,
2025-05-07T20:33:37.0909463Z     D=7168,
2025-05-07T20:33:37.0909652Z     scale_ub=1200.0,
2025-05-07T20:33:37.0909881Z     contiguous=True,
2025-05-07T20:33:37.0910099Z     compiled=False,
2025-05-07T20:33:37.0910298Z )
2025-05-07T20:33:37.0910617Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.0911099Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:37.0911372Z 
2025-05-07T20:33:37.0911475Z     @given(
2025-05-07T20:33:37.0911695Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.0912000Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.0912302Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.0912622Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.0912944Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.0913235Z     )
2025-05-07T20:33:37.0913582Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.0914010Z     def test_silu_mul_quant(
2025-05-07T20:33:37.0914254Z         self,
2025-05-07T20:33:37.0914448Z         T: int,
2025-05-07T20:33:37.0914639Z         D: int,
2025-05-07T20:33:37.0914857Z         scale_ub: Optional[float],
2025-05-07T20:33:37.0915126Z         contiguous: bool,
2025-05-07T20:33:37.0915360Z         compiled: bool,
2025-05-07T20:33:37.0915583Z     ) -> None:
2025-05-07T20:33:37.0915798Z         torch.manual_seed(2025)
2025-05-07T20:33:37.0916034Z     
2025-05-07T20:33:37.0916301Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.0916640Z     
2025-05-07T20:33:37.0916827Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.0917117Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.0919298Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:37.0921266Z 
2025-05-07T20:33:37.0921453Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:37.0921660Z 
2025-05-07T20:33:37.0921766Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.0922171Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.0922579Z     T=128,
2025-05-07T20:33:37.0922766Z     D=5120,
2025-05-07T20:33:37.0922951Z     scale_ub=1200.0,
2025-05-07T20:33:37.0923179Z     contiguous=True,
2025-05-07T20:33:37.0923402Z     compiled=True,
2025-05-07T20:33:37.0923596Z )
2025-05-07T20:33:37.0923907Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.0924394Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:37.0924742Z 
2025-05-07T20:33:37.0924829Z     @given(
2025-05-07T20:33:37.0925052Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.0925366Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.0925668Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.0925989Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.0926311Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.0926592Z     )
2025-05-07T20:33:37.0926931Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.0927382Z     def test_silu_mul_quant(
2025-05-07T20:33:37.0927621Z         self,
2025-05-07T20:33:37.0927856Z         T: int,
2025-05-07T20:33:37.0928049Z         D: int,
2025-05-07T20:33:37.0928266Z         scale_ub: Optional[float],
2025-05-07T20:33:37.0928531Z         contiguous: bool,
2025-05-07T20:33:37.0928769Z         compiled: bool,
2025-05-07T20:33:37.0928993Z     ) -> None:
2025-05-07T20:33:37.0929207Z         torch.manual_seed(2025)
2025-05-07T20:33:37.0929440Z     
2025-05-07T20:33:37.0929707Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.0930046Z     
2025-05-07T20:33:37.0930229Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.0930514Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.0932470Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:37.0934497Z 
2025-05-07T20:33:37.0934671Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:37.0934957Z 
2025-05-07T20:33:37.0935106Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.0935662Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.0936207Z     T=128,
2025-05-07T20:33:37.0936455Z     D=7168,
2025-05-07T20:33:37.0936706Z     scale_ub=None,
2025-05-07T20:33:37.0936993Z     contiguous=True,
2025-05-07T20:33:37.0937293Z     compiled=True,
2025-05-07T20:33:37.0937554Z )
2025-05-07T20:33:37.0937992Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.0938647Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:37.0939005Z 
2025-05-07T20:33:37.0939116Z     @given(
2025-05-07T20:33:37.0939411Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.0939894Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.0940560Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.0941000Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.0941459Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.0941839Z     )
2025-05-07T20:33:37.0942306Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.0943009Z     def test_silu_mul_quant(
2025-05-07T20:33:37.0943337Z         self,
2025-05-07T20:33:37.0943588Z         T: int,
2025-05-07T20:33:37.0943854Z         D: int,
2025-05-07T20:33:37.0944148Z         scale_ub: Optional[float],
2025-05-07T20:33:37.0944505Z         contiguous: bool,
2025-05-07T20:33:37.0944828Z         compiled: bool,
2025-05-07T20:33:37.0945124Z     ) -> None:
2025-05-07T20:33:37.0957543Z         torch.manual_seed(2025)
2025-05-07T20:33:37.0957874Z     
2025-05-07T20:33:37.0958248Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.0961265Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:37.0963864Z 
2025-05-07T20:33:37.0964029Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:37.0964313Z 
2025-05-07T20:33:37.0964855Z FAILED
2025-05-07T20:33:37.0965003Z 
2025-05-07T20:33:37.0965171Z =================================== FAILURES ===================================
2025-05-07T20:33:37.0965838Z _____________________ ActivationTests.test_silu_mul_quant ______________________
2025-05-07T20:33:37.0966434Z   + Exception Group Traceback (most recent call last):
2025-05-07T20:33:37.0967259Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 58, in testPartExecutor
2025-05-07T20:33:37.0967999Z   |     yield
2025-05-07T20:33:37.0968586Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 651, in run
2025-05-07T20:33:37.0969283Z   |     self._callTestMethod(testMethod)
2025-05-07T20:33:37.0969666Z   |     ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
2025-05-07T20:33:37.0970447Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/unittest/case.py", line 606, in _callTestMethod
2025-05-07T20:33:37.0971193Z   |     if method() is not None:
2025-05-07T20:33:37.0971525Z   |        ~~~~~~^^
2025-05-07T20:33:37.0972389Z   |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 75, in test_silu_mul_quant
2025-05-07T20:33:37.0973406Z   |     T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.0973791Z   |            ^^^^^^^
2025-05-07T20:33:37.0974555Z   |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/hypothesis/core.py", line 1850, in wrapped_test
2025-05-07T20:33:37.0975395Z   |     raise the_error_hypothesis_found
2025-05-07T20:33:37.0975969Z   | ExceptionGroup: Hypothesis found 4 distinct failures. (4 sub-exceptions)
2025-05-07T20:33:37.0976533Z   +-+---------------- 1 ----------------
2025-05-07T20:33:37.0976928Z     | Traceback (most recent call last):
2025-05-07T20:33:37.0977894Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:37.0978951Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.0981917Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:37.0984145Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:37.0984580Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.0984991Z     |     T=2048,
2025-05-07T20:33:37.0985222Z     |     D=5120,  # or any other generated value
2025-05-07T20:33:37.0985602Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:33:37.0986136Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:33:37.0986610Z     |     compiled=False,  # or any other generated value
2025-05-07T20:33:37.0986911Z     | )
2025-05-07T20:33:37.0987089Z     | 
2025-05-07T20:33:37.0987756Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEECQQBBAEEAQQE=') as a decorator on your test case
2025-05-07T20:33:37.0988365Z     +---------------- 2 ----------------
2025-05-07T20:33:37.0988656Z     | Traceback (most recent call last):
2025-05-07T20:33:37.0989350Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:37.0990100Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.0992113Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:37.0994198Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:37.0994630Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.0995030Z     |     T=128,
2025-05-07T20:33:37.0995222Z     |     D=7168,
2025-05-07T20:33:37.0995428Z     |     scale_ub=None,
2025-05-07T20:33:37.0995664Z     |     contiguous=True,
2025-05-07T20:33:37.0995893Z     |     compiled=True,
2025-05-07T20:33:37.0996111Z     | )
2025-05-07T20:33:37.0996288Z     | 
2025-05-07T20:33:37.0996891Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQFBAEEAQQA=') as a decorator on your test case
2025-05-07T20:33:37.0997736Z     +---------------- 3 ----------------
2025-05-07T20:33:37.0998129Z     | Traceback (most recent call last):
2025-05-07T20:33:37.0999078Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 92, in test_silu_mul_quant
2025-05-07T20:33:37.1000086Z     |     x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1002867Z     | torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:37.1005586Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:37.1006306Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1006864Z     |     T=128,
2025-05-07T20:33:37.1007129Z     |     D=5120,
2025-05-07T20:33:37.1007418Z     |     scale_ub=1200.0,
2025-05-07T20:33:37.1007751Z     |     contiguous=True,
2025-05-07T20:33:37.1008072Z     |     compiled=True,
2025-05-07T20:33:37.1008443Z     | )
2025-05-07T20:33:37.1008672Z     | 
2025-05-07T20:33:37.1009361Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEBQQBBAUEAQQA=') as a decorator on your test case
2025-05-07T20:33:37.1010199Z     +---------------- 4 ----------------
2025-05-07T20:33:37.1010591Z     | Traceback (most recent call last):
2025-05-07T20:33:37.1011562Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 126, in test_silu_mul_quant
2025-05-07T20:33:37.1012550Z     |     y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:37.1012941Z     |                              ~~~~~~^^
2025-05-07T20:33:37.1013886Z     |   File "/home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM/fbgemm_gpu/experimental/gen_ai/test/moe/activation_test.py", line 124, in ref_fn
2025-05-07T20:33:37.1014836Z     |     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:37.1016002Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py", line 2370, in triton_quantize_fp8_row
2025-05-07T20:33:37.1017100Z     |     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:37.1017488Z     |     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
2025-05-07T20:33:37.1017838Z     |         a,
2025-05-07T20:33:37.1018110Z     |         ^^
2025-05-07T20:33:37.1018389Z     |     ...<23 lines>...
2025-05-07T20:33:37.1018775Z     |         USE_INT64=use_int64,
2025-05-07T20:33:37.1019135Z     |         ^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:37.1019475Z     |     )
2025-05-07T20:33:37.1019716Z     |     ^
2025-05-07T20:33:37.1020441Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 330, in <lambda>
2025-05-07T20:33:37.1021447Z     |     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1022058Z     |                                    ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:37.1022946Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 186, in run
2025-05-07T20:33:37.1024018Z     |     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:37.1024662Z     |                        ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:37.1025550Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 166, in _bench
2025-05-07T20:33:37.1026515Z     |     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:37.1027033Z     |            ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-05-07T20:33:37.1027934Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py", line 117, in do_bench
2025-05-07T20:33:37.1028721Z     |     fn()
2025-05-07T20:33:37.1028992Z     |     ~~^^
2025-05-07T20:33:37.1029778Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py", line 152, in kernel_call
2025-05-07T20:33:37.1030691Z     |     self.fn.run(
2025-05-07T20:33:37.1030982Z     |     ~~~~~~~~~~~^
2025-05-07T20:33:37.1031269Z     |         *args,
2025-05-07T20:33:37.1031549Z     |         ^^^^^^
2025-05-07T20:33:37.1031825Z     |         **current,
2025-05-07T20:33:37.1032130Z     |         ^^^^^^^^^^
2025-05-07T20:33:37.1032424Z     |     )
2025-05-07T20:33:37.1032661Z     |     ^
2025-05-07T20:33:37.1033433Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py", line 623, in run
2025-05-07T20:33:37.1034241Z     |     kernel = self.compile(
2025-05-07T20:33:37.1034573Z     |         src,
2025-05-07T20:33:37.1034855Z     |         target=target,
2025-05-07T20:33:37.1035203Z     |         options=options.__dict__,
2025-05-07T20:33:37.1035561Z     |     )
2025-05-07T20:33:37.1036356Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 273, in compile
2025-05-07T20:33:37.1037316Z     |     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1038289Z     |   File "/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py", line 100, in make_ir
2025-05-07T20:33:37.1039379Z     |     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1040037Z     |                        module_map=module_map)
2025-05-07T20:33:37.1040822Z     | triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1041417Z     | def _kernel_quantize_fp8_row(
2025-05-07T20:33:37.1041787Z     | ^
2025-05-07T20:33:37.1042415Z     | ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1043202Z     | Falsifying example: test_silu_mul_quant(
2025-05-07T20:33:37.1043729Z     |     # The test always failed when commented parts were varied together.
2025-05-07T20:33:37.1044417Z     |     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1044990Z     |     T=1,  # or any other generated value
2025-05-07T20:33:37.1045395Z     |     D=5120,  # or any other generated value
2025-05-07T20:33:37.1045855Z     |     scale_ub=None,  # or any other generated value
2025-05-07T20:33:37.1046407Z     |     contiguous=True,  # or any other generated value
2025-05-07T20:33:37.1046888Z     |     compiled=True,  # or any other generated value
2025-05-07T20:33:37.1047288Z     | )
2025-05-07T20:33:37.1047521Z     | 
2025-05-07T20:33:37.1048219Z     | You can reproduce this example by temporarily adding @reproduce_failure('6.131.14', b'AEEAQQBBAEEAQQA=') as a decorator on your test case
2025-05-07T20:33:37.1049030Z     +------------------------------------
2025-05-07T20:33:37.1049536Z ---------------------------------- Hypothesis ----------------------------------
2025-05-07T20:33:37.1050046Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1050608Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1051152Z     T=1,
2025-05-07T20:33:37.1051394Z     D=5120,
2025-05-07T20:33:37.1051651Z     scale_ub=None,
2025-05-07T20:33:37.1051937Z     contiguous=True,
2025-05-07T20:33:37.1052224Z     compiled=True,
2025-05-07T20:33:37.1052478Z )
2025-05-07T20:33:37.1052883Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1053494Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:37.1053849Z 
2025-05-07T20:33:37.1053956Z     @given(
2025-05-07T20:33:37.1054264Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1054681Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1055083Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1055524Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1055953Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1056317Z     )
2025-05-07T20:33:37.1056795Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1057393Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1057696Z         self,
2025-05-07T20:33:37.1057951Z         T: int,
2025-05-07T20:33:37.1058208Z         D: int,
2025-05-07T20:33:37.1058475Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1058818Z         contiguous: bool,
2025-05-07T20:33:37.1059121Z         compiled: bool,
2025-05-07T20:33:37.1059485Z     ) -> None:
2025-05-07T20:33:37.1059761Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1060071Z     
2025-05-07T20:33:37.1060464Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1060895Z     
2025-05-07T20:33:37.1061135Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1061500Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1061962Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1062273Z         x0 = x[:, :D]
2025-05-07T20:33:37.1062547Z         x1 = x[:, D:]
2025-05-07T20:33:37.1062815Z     
2025-05-07T20:33:37.1063049Z         if contiguous:
2025-05-07T20:33:37.1063339Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1063657Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1063965Z     
2025-05-07T20:33:37.1064210Z         if scale_ub is not None:
2025-05-07T20:33:37.1064546Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1064969Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1065356Z             )
2025-05-07T20:33:37.1065649Z         else:
2025-05-07T20:33:37.1065924Z             scale_ub_tensor = None
2025-05-07T20:33:37.1066251Z     
2025-05-07T20:33:37.1066547Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1066938Z             op = silu_mul_quant
2025-05-07T20:33:37.1067258Z             if compiled:
2025-05-07T20:33:37.1067700Z                 op = torch.compile(op)
2025-05-07T20:33:37.1068078Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1068436Z     
2025-05-07T20:33:37.1068689Z         y_fp8, y_scale = fn()
2025-05-07T20:33:37.1069059Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:37.1069444Z     
2025-05-07T20:33:37.1069757Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1070241Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:37.1070612Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:37.1071019Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:37.1071476Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:37.1071889Z     
2025-05-07T20:33:37.1072154Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:37.1072423Z 
2025-05-07T20:33:37.1072561Z moe/activation_test.py:126: 
2025-05-07T20:33:37.1072948Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1073399Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:37.1073827Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:37.1074909Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:37.1075934Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:37.1076676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1077556Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1078445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:37.1079385Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:37.1080335Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:37.1081165Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:37.1081940Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:37.1082607Z     fn()
2025-05-07T20:33:37.1083258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:37.1084008Z     self.fn.run(
2025-05-07T20:33:37.1084671Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1085368Z     kernel = self.compile(
2025-05-07T20:33:37.1086071Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1086907Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1087486Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1087777Z 
2025-05-07T20:33:37.1088050Z self = <triton.compiler.compiler.ASTSource object at 0x7f89f85b2270>
2025-05-07T20:33:37.1089489Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1091379Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89f5d3a700>}
2025-05-07T20:33:37.1093204Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1094549Z context = <triton._C.libtriton.ir.context object at 0x7f89f60bf530>
2025-05-07T20:33:37.1094922Z 
2025-05-07T20:33:37.1095145Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1095844Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1096486Z                            module_map=module_map)
2025-05-07T20:33:37.1096969Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1097550Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:37.1097902Z E       ^
2025-05-07T20:33:37.1098533Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1099139Z 
2025-05-07T20:33:37.1099713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1100441Z 
2025-05-07T20:33:37.1100575Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1101127Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1101668Z     T=2048,
2025-05-07T20:33:37.1101918Z     D=5120,
2025-05-07T20:33:37.1102164Z     scale_ub=1200.0,
2025-05-07T20:33:37.1102458Z     contiguous=True,
2025-05-07T20:33:37.1102752Z     compiled=False,
2025-05-07T20:33:37.1103018Z )
2025-05-07T20:33:37.1103437Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1104091Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:37.1104447Z 
2025-05-07T20:33:37.1104548Z     @given(
2025-05-07T20:33:37.1104854Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1105265Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1105649Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1106067Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1106486Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1106860Z     )
2025-05-07T20:33:37.1107303Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1107972Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1108281Z         self,
2025-05-07T20:33:37.1108522Z         T: int,
2025-05-07T20:33:37.1108776Z         D: int,
2025-05-07T20:33:37.1109061Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1109399Z         contiguous: bool,
2025-05-07T20:33:37.1109706Z         compiled: bool,
2025-05-07T20:33:37.1109994Z     ) -> None:
2025-05-07T20:33:37.1110261Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1110571Z     
2025-05-07T20:33:37.1110972Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1111408Z     
2025-05-07T20:33:37.1111651Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1112023Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1112412Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1112721Z         x0 = x[:, :D]
2025-05-07T20:33:37.1113050Z         x1 = x[:, D:]
2025-05-07T20:33:37.1113316Z     
2025-05-07T20:33:37.1113548Z         if contiguous:
2025-05-07T20:33:37.1113844Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1114176Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1114478Z     
2025-05-07T20:33:37.1114727Z         if scale_ub is not None:
2025-05-07T20:33:37.1115077Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1115501Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1115891Z             )
2025-05-07T20:33:37.1116140Z         else:
2025-05-07T20:33:37.1116404Z             scale_ub_tensor = None
2025-05-07T20:33:37.1116731Z     
2025-05-07T20:33:37.1117095Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1117520Z             op = silu_mul_quant
2025-05-07T20:33:37.1117868Z             if compiled:
2025-05-07T20:33:37.1118208Z                 op = torch.compile(op)
2025-05-07T20:33:37.1118606Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1118979Z     
2025-05-07T20:33:37.1119229Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1119443Z 
2025-05-07T20:33:37.1119585Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1119979Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1120433Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1120803Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1121788Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1122688Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1123385Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1124266Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1125126Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1125817Z     kernel = self.compile(
2025-05-07T20:33:37.1126515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1127358Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1127871Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1128179Z 
2025-05-07T20:33:37.1128438Z self = <triton.compiler.compiler.ASTSource object at 0x7f89f5ccd090>
2025-05-07T20:33:37.1129840Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1131698Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89f5bf2020>}
2025-05-07T20:33:37.1133462Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1134805Z context = <triton._C.libtriton.ir.context object at 0x7f89f615ff30>
2025-05-07T20:33:37.1135180Z 
2025-05-07T20:33:37.1135400Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1136074Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1136738Z                            module_map=module_map)
2025-05-07T20:33:37.1137213Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1137661Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1137984Z E       ^
2025-05-07T20:33:37.1138584Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1139274Z 
2025-05-07T20:33:37.1139834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1140770Z 
2025-05-07T20:33:37.1140920Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1141474Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1142011Z     T=2048,
2025-05-07T20:33:37.1142272Z     D=5120,
2025-05-07T20:33:37.1142537Z     scale_ub=1200.0,
2025-05-07T20:33:37.1173251Z     contiguous=True,
2025-05-07T20:33:37.1173524Z     compiled=True,
2025-05-07T20:33:37.1173751Z )
2025-05-07T20:33:37.1174396Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1174898Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:37.1175179Z 
2025-05-07T20:33:37.1175268Z     @given(
2025-05-07T20:33:37.1175492Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1175816Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1176124Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1176444Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1176765Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1177041Z     )
2025-05-07T20:33:37.1177387Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1177920Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1178252Z         self,
2025-05-07T20:33:37.1178493Z         T: int,
2025-05-07T20:33:37.1178818Z         D: int,
2025-05-07T20:33:37.1179333Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1179658Z         contiguous: bool,
2025-05-07T20:33:37.1179988Z         compiled: bool,
2025-05-07T20:33:37.1180385Z     ) -> None:
2025-05-07T20:33:37.1194734Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1195002Z     
2025-05-07T20:33:37.1195276Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1195641Z     
2025-05-07T20:33:37.1195832Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1196113Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1196425Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1196665Z         x0 = x[:, :D]
2025-05-07T20:33:37.1196873Z         x1 = x[:, D:]
2025-05-07T20:33:37.1197076Z     
2025-05-07T20:33:37.1197260Z         if contiguous:
2025-05-07T20:33:37.1197491Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1197738Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1197976Z     
2025-05-07T20:33:37.1198170Z         if scale_ub is not None:
2025-05-07T20:33:37.1198434Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1198766Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1199076Z             )
2025-05-07T20:33:37.1199260Z         else:
2025-05-07T20:33:37.1199464Z             scale_ub_tensor = None
2025-05-07T20:33:37.1199714Z     
2025-05-07T20:33:37.1199933Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1200244Z             op = silu_mul_quant
2025-05-07T20:33:37.1200486Z             if compiled:
2025-05-07T20:33:37.1200724Z                 op = torch.compile(op)
2025-05-07T20:33:37.1201016Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1201288Z     
2025-05-07T20:33:37.1201470Z         y_fp8, y_scale = fn()
2025-05-07T20:33:37.1201753Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:37.1202036Z     
2025-05-07T20:33:37.1202390Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1202716Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:37.1203003Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:37.1203312Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:37.1203657Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:37.1204052Z     
2025-05-07T20:33:37.1204251Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:37.1204441Z 
2025-05-07T20:33:37.1204541Z moe/activation_test.py:126: 
2025-05-07T20:33:37.1204833Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1205162Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:37.1205485Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:37.1206254Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:37.1207015Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:37.1207600Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1208267Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1208953Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:37.1209665Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:37.1210392Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:37.1211065Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:37.1211670Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:37.1212244Z     fn()
2025-05-07T20:33:37.1212758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:37.1213350Z     self.fn.run(
2025-05-07T20:33:37.1213823Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1214343Z     kernel = self.compile(
2025-05-07T20:33:37.1214893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1215561Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1215957Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1216186Z 
2025-05-07T20:33:37.1216395Z self = <triton.compiler.compiler.ASTSource object at 0x7f89f5cce0d0>
2025-05-07T20:33:37.1217457Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1218830Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89f4acf560>}
2025-05-07T20:33:37.1220190Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1221297Z context = <triton._C.libtriton.ir.context object at 0x7f89f48a2530>
2025-05-07T20:33:37.1221578Z 
2025-05-07T20:33:37.1221745Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1222252Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1222720Z                            module_map=module_map)
2025-05-07T20:33:37.1223078Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1223466Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:37.1223734Z E       ^
2025-05-07T20:33:37.1224196Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1224643Z 
2025-05-07T20:33:37.1225078Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1225642Z 
2025-05-07T20:33:37.1225744Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1226152Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1226562Z     T=16384,
2025-05-07T20:33:37.1226743Z     D=7168,
2025-05-07T20:33:37.1226940Z     scale_ub=1200.0,
2025-05-07T20:33:37.1227163Z     contiguous=False,
2025-05-07T20:33:37.1227386Z     compiled=False,
2025-05-07T20:33:37.1227651Z )
2025-05-07T20:33:37.1227967Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1228452Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:37.1228799Z 
2025-05-07T20:33:37.1228875Z     @given(
2025-05-07T20:33:37.1229103Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1229416Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1229713Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1230038Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1230360Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1230631Z     )
2025-05-07T20:33:37.1230987Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1231439Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1231670Z         self,
2025-05-07T20:33:37.1231915Z         T: int,
2025-05-07T20:33:37.1232115Z         D: int,
2025-05-07T20:33:37.1232322Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1232588Z         contiguous: bool,
2025-05-07T20:33:37.1232824Z         compiled: bool,
2025-05-07T20:33:37.1233044Z     ) -> None:
2025-05-07T20:33:37.1233254Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1233500Z     
2025-05-07T20:33:37.1233772Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1234116Z     
2025-05-07T20:33:37.1234308Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1234599Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1234905Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1235143Z         x0 = x[:, :D]
2025-05-07T20:33:37.1235359Z         x1 = x[:, D:]
2025-05-07T20:33:37.1235554Z     
2025-05-07T20:33:37.1235736Z         if contiguous:
2025-05-07T20:33:37.1235965Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1236211Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1236453Z     
2025-05-07T20:33:37.1236642Z         if scale_ub is not None:
2025-05-07T20:33:37.1236904Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1237237Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1237554Z             )
2025-05-07T20:33:37.1237752Z         else:
2025-05-07T20:33:37.1237956Z             scale_ub_tensor = None
2025-05-07T20:33:37.1238202Z     
2025-05-07T20:33:37.1238427Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1238727Z             op = silu_mul_quant
2025-05-07T20:33:37.1238979Z             if compiled:
2025-05-07T20:33:37.1239229Z                 op = torch.compile(op)
2025-05-07T20:33:37.1239516Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1239789Z     
2025-05-07T20:33:37.1239981Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1240489Z 
2025-05-07T20:33:37.1240629Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1240934Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1241266Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1241542Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1242324Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1243006Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1243540Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1244258Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1244908Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1245427Z     kernel = self.compile(
2025-05-07T20:33:37.1245975Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1246638Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1247024Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1247250Z 
2025-05-07T20:33:37.1247520Z self = <triton.compiler.compiler.ASTSource object at 0x7f89f4a69220>
2025-05-07T20:33:37.1248619Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1249963Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89f4d787c0>}
2025-05-07T20:33:37.1251327Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1252405Z context = <triton._C.libtriton.ir.context object at 0x7f89f48c9cb0>
2025-05-07T20:33:37.1252687Z 
2025-05-07T20:33:37.1252864Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1253386Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1253863Z                            module_map=module_map)
2025-05-07T20:33:37.1254233Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1254591Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1254845Z E       ^
2025-05-07T20:33:37.1255312Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1255766Z 
2025-05-07T20:33:37.1256190Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1256691Z 
2025-05-07T20:33:37.1256796Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1257207Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1257609Z     T=1,
2025-05-07T20:33:37.1257796Z     D=7168,
2025-05-07T20:33:37.1257981Z     scale_ub=None,
2025-05-07T20:33:37.1258197Z     contiguous=True,
2025-05-07T20:33:37.1258418Z     compiled=True,
2025-05-07T20:33:37.1258617Z )
2025-05-07T20:33:37.1258939Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1259417Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:37.1259672Z 
2025-05-07T20:33:37.1259748Z     @given(
2025-05-07T20:33:37.1259971Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1260273Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1260568Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1260897Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1261226Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1261507Z     )
2025-05-07T20:33:37.1261901Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1262354Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1262598Z         self,
2025-05-07T20:33:37.1262785Z         T: int,
2025-05-07T20:33:37.1262986Z         D: int,
2025-05-07T20:33:37.1263200Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1263464Z         contiguous: bool,
2025-05-07T20:33:37.1263704Z         compiled: bool,
2025-05-07T20:33:37.1263967Z     ) -> None:
2025-05-07T20:33:37.1264172Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1264412Z     
2025-05-07T20:33:37.1264682Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1265013Z     
2025-05-07T20:33:37.1265209Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1265495Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1265806Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1266045Z         x0 = x[:, :D]
2025-05-07T20:33:37.1266256Z         x1 = x[:, D:]
2025-05-07T20:33:37.1266466Z     
2025-05-07T20:33:37.1266644Z         if contiguous:
2025-05-07T20:33:37.1266873Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1267165Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1267387Z     
2025-05-07T20:33:37.1267620Z         if scale_ub is not None:
2025-05-07T20:33:37.1267878Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1268193Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1268485Z             )
2025-05-07T20:33:37.1268664Z         else:
2025-05-07T20:33:37.1268858Z             scale_ub_tensor = None
2025-05-07T20:33:37.1269093Z     
2025-05-07T20:33:37.1269305Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1269597Z             op = silu_mul_quant
2025-05-07T20:33:37.1269831Z             if compiled:
2025-05-07T20:33:37.1270070Z                 op = torch.compile(op)
2025-05-07T20:33:37.1270402Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1270661Z     
2025-05-07T20:33:37.1270837Z         y_fp8, y_scale = fn()
2025-05-07T20:33:37.1271107Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:37.1271384Z     
2025-05-07T20:33:37.1271602Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1271919Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:37.1272190Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:37.1272488Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:37.1272830Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:37.1273124Z     
2025-05-07T20:33:37.1273318Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:37.1273507Z 
2025-05-07T20:33:37.1273607Z moe/activation_test.py:126: 
2025-05-07T20:33:37.1273891Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1274212Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:37.1274517Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:37.1275283Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:37.1276006Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:37.1276533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1277198Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1277868Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:37.1278561Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:37.1279281Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:37.1279899Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:37.1280551Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:37.1281052Z     fn()
2025-05-07T20:33:37.1281550Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:37.1282137Z     self.fn.run(
2025-05-07T20:33:37.1282598Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1283145Z     kernel = self.compile(
2025-05-07T20:33:37.1283687Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1284317Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1284703Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1284929Z 
2025-05-07T20:33:37.1285127Z self = <triton.compiler.compiler.ASTSource object at 0x7f89f4a6b950>
2025-05-07T20:33:37.1286226Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1287567Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89f4baa340>}
2025-05-07T20:33:37.1288922Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1289920Z context = <triton._C.libtriton.ir.context object at 0x7f89f454d470>
2025-05-07T20:33:37.1290201Z 
2025-05-07T20:33:37.1290402Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1290924Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1291388Z                            module_map=module_map)
2025-05-07T20:33:37.1291747Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1292096Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:37.1292345Z E       ^
2025-05-07T20:33:37.1292788Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1293236Z 
2025-05-07T20:33:37.1293659Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1294162Z 
2025-05-07T20:33:37.1294260Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1294656Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1295061Z     T=4096,
2025-05-07T20:33:37.1295243Z     D=5120,
2025-05-07T20:33:37.1295423Z     scale_ub=None,
2025-05-07T20:33:37.1295631Z     contiguous=False,
2025-05-07T20:33:37.1295848Z     compiled=False,
2025-05-07T20:33:37.1296036Z )
2025-05-07T20:33:37.1296347Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1296836Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:37.1297103Z 
2025-05-07T20:33:37.1297186Z     @given(
2025-05-07T20:33:37.1297402Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1297702Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1297995Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1298309Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1298628Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1298901Z     )
2025-05-07T20:33:37.1299228Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1299660Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1299889Z         self,
2025-05-07T20:33:37.1300152Z         T: int,
2025-05-07T20:33:37.1300336Z         D: int,
2025-05-07T20:33:37.1300548Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1300637Z         contiguous: bool,
2025-05-07T20:33:37.1300722Z         compiled: bool,
2025-05-07T20:33:37.1300796Z     ) -> None:
2025-05-07T20:33:37.1300905Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1300990Z     
2025-05-07T20:33:37.1301218Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1301283Z     
2025-05-07T20:33:37.1301371Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1301488Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1301571Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1301647Z         x0 = x[:, :D]
2025-05-07T20:33:37.1301721Z         x1 = x[:, D:]
2025-05-07T20:33:37.1301791Z     
2025-05-07T20:33:37.1301872Z         if contiguous:
2025-05-07T20:33:37.1301958Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1302048Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1302120Z     
2025-05-07T20:33:37.1302205Z         if scale_ub is not None:
2025-05-07T20:33:37.1302354Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1302487Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1302563Z             )
2025-05-07T20:33:37.1302643Z         else:
2025-05-07T20:33:37.1302733Z             scale_ub_tensor = None
2025-05-07T20:33:37.1302808Z     
2025-05-07T20:33:37.1302940Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1303031Z             op = silu_mul_quant
2025-05-07T20:33:37.1303113Z             if compiled:
2025-05-07T20:33:37.1303217Z                 op = torch.compile(op)
2025-05-07T20:33:37.1303325Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1303405Z     
2025-05-07T20:33:37.1303537Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1303541Z 
2025-05-07T20:33:37.1303635Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1303773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1303876Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1303972Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1304483Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1304579Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1304945Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1305168Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1305509Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1305607Z     kernel = self.compile(
2025-05-07T20:33:37.1305991Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1306165Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1306298Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1306303Z 
2025-05-07T20:33:37.1306502Z self = <triton.compiler.compiler.ASTSource object at 0x7f89f4278b90>
2025-05-07T20:33:37.1307270Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1307850Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89f4bab100>}
2025-05-07T20:33:37.1308634Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1308826Z context = <triton._C.libtriton.ir.context object at 0x7f89f456f4b0>
2025-05-07T20:33:37.1308830Z 
2025-05-07T20:33:37.1308989Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1309263Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1309408Z                            module_map=module_map)
2025-05-07T20:33:37.1309575Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1309669Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1309747Z E       ^
2025-05-07T20:33:37.1310106Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1310110Z 
2025-05-07T20:33:37.1310534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1310538Z 
2025-05-07T20:33:37.1310639Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1310902Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1310977Z     T=4096,
2025-05-07T20:33:37.1311060Z     D=7168,
2025-05-07T20:33:37.1311140Z     scale_ub=None,
2025-05-07T20:33:37.1311222Z     contiguous=False,
2025-05-07T20:33:37.1311306Z     compiled=False,
2025-05-07T20:33:37.1311382Z )
2025-05-07T20:33:37.1311596Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1311772Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:37.1311776Z 
2025-05-07T20:33:37.1311850Z     @given(
2025-05-07T20:33:37.1311964Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1312066Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1312219Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1312337Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1312448Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1312517Z     )
2025-05-07T20:33:37.1312772Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1312861Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1312932Z         self,
2025-05-07T20:33:37.1313012Z         T: int,
2025-05-07T20:33:37.1313087Z         D: int,
2025-05-07T20:33:37.1313182Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1313274Z         contiguous: bool,
2025-05-07T20:33:37.1313355Z         compiled: bool,
2025-05-07T20:33:37.1313432Z     ) -> None:
2025-05-07T20:33:37.1313529Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1313597Z     
2025-05-07T20:33:37.1313767Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1313843Z     
2025-05-07T20:33:37.1313931Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1314060Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1314150Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1314228Z         x0 = x[:, :D]
2025-05-07T20:33:37.1314313Z         x1 = x[:, D:]
2025-05-07T20:33:37.1314383Z     
2025-05-07T20:33:37.1314463Z         if contiguous:
2025-05-07T20:33:37.1314558Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1314640Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1314707Z     
2025-05-07T20:33:37.1314801Z         if scale_ub is not None:
2025-05-07T20:33:37.1314904Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1315041Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1315114Z             )
2025-05-07T20:33:37.1315187Z         else:
2025-05-07T20:33:37.1315282Z             scale_ub_tensor = None
2025-05-07T20:33:37.1315353Z     
2025-05-07T20:33:37.1315481Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1315577Z             op = silu_mul_quant
2025-05-07T20:33:37.1315661Z             if compiled:
2025-05-07T20:33:37.1315804Z                 op = torch.compile(op)
2025-05-07T20:33:37.1315916Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1315985Z     
2025-05-07T20:33:37.1316071Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1316075Z 
2025-05-07T20:33:37.1316174Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1316297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1316440Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1316534Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1317041Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1317143Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1317506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1317727Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1318116Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1318209Z     kernel = self.compile(
2025-05-07T20:33:37.1318613Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1318784Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1318908Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1318912Z 
2025-05-07T20:33:37.1319117Z self = <triton.compiler.compiler.ASTSource object at 0x7f89f4aeb020>
2025-05-07T20:33:37.1319934Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1320494Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89f4baa980>}
2025-05-07T20:33:37.1321271Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1321460Z context = <triton._C.libtriton.ir.context object at 0x7f89f4536770>
2025-05-07T20:33:37.1321473Z 
2025-05-07T20:33:37.1321632Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1321893Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1322005Z                            module_map=module_map)
2025-05-07T20:33:37.1322163Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1322260Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1322340Z E       ^
2025-05-07T20:33:37.1322700Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1322705Z 
2025-05-07T20:33:37.1323141Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1323146Z 
2025-05-07T20:33:37.1323246Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1323465Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1323547Z     T=128,
2025-05-07T20:33:37.1323621Z     D=7168,
2025-05-07T20:33:37.1323703Z     scale_ub=None,
2025-05-07T20:33:37.1323796Z     contiguous=False,
2025-05-07T20:33:37.1323875Z     compiled=True,
2025-05-07T20:33:37.1323947Z )
2025-05-07T20:33:37.1324175Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1324343Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:37.1324348Z 
2025-05-07T20:33:37.1324474Z     @given(
2025-05-07T20:33:37.1324593Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1324691Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1324811Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1324923Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1325031Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1325149Z     )
2025-05-07T20:33:37.1325390Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1325490Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1325565Z         self,
2025-05-07T20:33:37.1325639Z         T: int,
2025-05-07T20:33:37.1325719Z         D: int,
2025-05-07T20:33:37.1325811Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1325899Z         contiguous: bool,
2025-05-07T20:33:37.1325985Z         compiled: bool,
2025-05-07T20:33:37.1326061Z     ) -> None:
2025-05-07T20:33:37.1326153Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1326233Z     
2025-05-07T20:33:37.1326436Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1326509Z     
2025-05-07T20:33:37.1326602Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1326722Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1326805Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1326890Z         x0 = x[:, :D]
2025-05-07T20:33:37.1326964Z         x1 = x[:, D:]
2025-05-07T20:33:37.1327040Z     
2025-05-07T20:33:37.1327121Z         if contiguous:
2025-05-07T20:33:37.1327207Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1327296Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1327366Z     
2025-05-07T20:33:37.1327456Z         if scale_ub is not None:
2025-05-07T20:33:37.1327563Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1327763Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1327837Z             )
2025-05-07T20:33:37.1327918Z         else:
2025-05-07T20:33:37.1328011Z             scale_ub_tensor = None
2025-05-07T20:33:37.1328085Z     
2025-05-07T20:33:37.1328215Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1328300Z             op = silu_mul_quant
2025-05-07T20:33:37.1328388Z             if compiled:
2025-05-07T20:33:37.1328484Z                 op = torch.compile(op)
2025-05-07T20:33:37.1328588Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1328663Z     
2025-05-07T20:33:37.1328750Z         y_fp8, y_scale = fn()
2025-05-07T20:33:37.1328866Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:37.1328942Z     
2025-05-07T20:33:37.1329072Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1329170Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:37.1329277Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:37.1329393Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:37.1329540Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:37.1329613Z     
2025-05-07T20:33:37.1329712Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:37.1329717Z 
2025-05-07T20:33:37.1329817Z moe/activation_test.py:126: 
2025-05-07T20:33:37.1329942Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1330046Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:37.1330181Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:37.1330746Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:37.1330841Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:37.1331222Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1331445Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1331874Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:37.1332136Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:37.1332506Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:37.1337612Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:37.1337986Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:37.1338067Z     fn()
2025-05-07T20:33:37.1338473Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:37.1338557Z     self.fn.run(
2025-05-07T20:33:37.1338905Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1338999Z     kernel = self.compile(
2025-05-07T20:33:37.1339440Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1339613Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1339735Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1339743Z 
2025-05-07T20:33:37.1339948Z self = <triton.compiler.compiler.ASTSource object at 0x7f89f42419d0>
2025-05-07T20:33:37.1341325Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1341851Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89f4253d80>}
2025-05-07T20:33:37.1342739Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1342921Z context = <triton._C.libtriton.ir.context object at 0x7f89f4481a30>
2025-05-07T20:33:37.1342926Z 
2025-05-07T20:33:37.1343089Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1343358Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1343466Z                            module_map=module_map)
2025-05-07T20:33:37.1343623Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1343718Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:37.1343795Z E       ^
2025-05-07T20:33:37.1344169Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1344174Z 
2025-05-07T20:33:37.1344593Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1344598Z 
2025-05-07T20:33:37.1344697Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1344917Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1344995Z     T=128,
2025-05-07T20:33:37.1345067Z     D=7168,
2025-05-07T20:33:37.1345146Z     scale_ub=None,
2025-05-07T20:33:37.1345228Z     contiguous=False,
2025-05-07T20:33:37.1345307Z     compiled=False,
2025-05-07T20:33:37.1345379Z )
2025-05-07T20:33:37.1345602Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1345771Z T = 128, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:37.1345778Z 
2025-05-07T20:33:37.1345850Z     @given(
2025-05-07T20:33:37.1345971Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1346130Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1346245Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1346357Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1346463Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1346537Z     )
2025-05-07T20:33:37.1346780Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1346923Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1346999Z         self,
2025-05-07T20:33:37.1347073Z         T: int,
2025-05-07T20:33:37.1347144Z         D: int,
2025-05-07T20:33:37.1347243Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1347333Z         contiguous: bool,
2025-05-07T20:33:37.1347474Z         compiled: bool,
2025-05-07T20:33:37.1347560Z     ) -> None:
2025-05-07T20:33:37.1347658Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1347740Z     
2025-05-07T20:33:37.1347923Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1348002Z     
2025-05-07T20:33:37.1348100Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1348293Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1348379Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1348459Z         x0 = x[:, :D]
2025-05-07T20:33:37.1348534Z         x1 = x[:, D:]
2025-05-07T20:33:37.1348599Z     
2025-05-07T20:33:37.1348684Z         if contiguous:
2025-05-07T20:33:37.1348769Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1348849Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1348920Z     
2025-05-07T20:33:37.1349004Z         if scale_ub is not None:
2025-05-07T20:33:37.1349107Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1349237Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1349310Z             )
2025-05-07T20:33:37.1349427Z         else:
2025-05-07T20:33:37.1349516Z             scale_ub_tensor = None
2025-05-07T20:33:37.1349582Z     
2025-05-07T20:33:37.1349712Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1349797Z             op = silu_mul_quant
2025-05-07T20:33:37.1349884Z             if compiled:
2025-05-07T20:33:37.1349985Z                 op = torch.compile(op)
2025-05-07T20:33:37.1350087Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1350153Z     
2025-05-07T20:33:37.1350246Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1350253Z 
2025-05-07T20:33:37.1350345Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1350472Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1350567Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1350661Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1351174Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1351266Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1351625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1351846Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1352191Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1352284Z     kernel = self.compile(
2025-05-07T20:33:37.1352667Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1352836Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1352959Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1352964Z 
2025-05-07T20:33:37.1353161Z self = <triton.compiler.compiler.ASTSource object at 0x7f89f4920f50>
2025-05-07T20:33:37.1353980Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1354472Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89cbd29760>}
2025-05-07T20:33:37.1355200Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1355427Z context = <triton._C.libtriton.ir.context object at 0x7f89cb5dc270>
2025-05-07T20:33:37.1355431Z 
2025-05-07T20:33:37.1355587Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1355850Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1355955Z                            module_map=module_map)
2025-05-07T20:33:37.1356114Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1356248Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1356321Z E       ^
2025-05-07T20:33:37.1356675Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1356679Z 
2025-05-07T20:33:37.1357104Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1357111Z 
2025-05-07T20:33:37.1357208Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1357431Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1357503Z     T=4096,
2025-05-07T20:33:37.1357577Z     D=5120,
2025-05-07T20:33:37.1357667Z     scale_ub=1200.0,
2025-05-07T20:33:37.1357786Z     contiguous=True,
2025-05-07T20:33:37.1357870Z     compiled=False,
2025-05-07T20:33:37.1357940Z )
2025-05-07T20:33:37.1358158Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1358334Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:37.1358339Z 
2025-05-07T20:33:37.1358410Z     @given(
2025-05-07T20:33:37.1358524Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1358627Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1358739Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1358851Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1358966Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1359036Z     )
2025-05-07T20:33:37.1359281Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1359370Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1359438Z         self,
2025-05-07T20:33:37.1359518Z         T: int,
2025-05-07T20:33:37.1359589Z         D: int,
2025-05-07T20:33:37.1359681Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1359775Z         contiguous: bool,
2025-05-07T20:33:37.1359856Z         compiled: bool,
2025-05-07T20:33:37.1359930Z     ) -> None:
2025-05-07T20:33:37.1360025Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1360092Z     
2025-05-07T20:33:37.1360256Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1360333Z     
2025-05-07T20:33:37.1360422Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1360544Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1360624Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1360695Z         x0 = x[:, :D]
2025-05-07T20:33:37.1360769Z         x1 = x[:, D:]
2025-05-07T20:33:37.1360835Z     
2025-05-07T20:33:37.1360913Z         if contiguous:
2025-05-07T20:33:37.1361000Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1361083Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1361146Z     
2025-05-07T20:33:37.1361229Z         if scale_ub is not None:
2025-05-07T20:33:37.1361373Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1361505Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1361583Z             )
2025-05-07T20:33:37.1361653Z         else:
2025-05-07T20:33:37.1361741Z             scale_ub_tensor = None
2025-05-07T20:33:37.1361813Z     
2025-05-07T20:33:37.1361936Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1362065Z             op = silu_mul_quant
2025-05-07T20:33:37.1362146Z             if compiled:
2025-05-07T20:33:37.1362241Z                 op = torch.compile(op)
2025-05-07T20:33:37.1362345Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1362412Z     
2025-05-07T20:33:37.1362496Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1362500Z 
2025-05-07T20:33:37.1362592Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1362715Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1362808Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1362903Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1363465Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1363558Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1363903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1364123Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1364460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1364546Z     kernel = self.compile(
2025-05-07T20:33:37.1364939Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1365148Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1365269Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1365274Z 
2025-05-07T20:33:37.1365475Z self = <triton.compiler.compiler.ASTSource object at 0x7f89f4923850>
2025-05-07T20:33:37.1366235Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1366745Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89cbd2a3e0>}
2025-05-07T20:33:37.1367476Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1367663Z context = <triton._C.libtriton.ir.context object at 0x7f89f42dfc70>
2025-05-07T20:33:37.1367667Z 
2025-05-07T20:33:37.1367829Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1368089Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1368192Z                            module_map=module_map)
2025-05-07T20:33:37.1368345Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1368438Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1368516Z E       ^
2025-05-07T20:33:37.1368864Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1368869Z 
2025-05-07T20:33:37.1369286Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1369298Z 
2025-05-07T20:33:37.1369392Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1369651Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1369729Z     T=1,
2025-05-07T20:33:37.1369801Z     D=5120,
2025-05-07T20:33:37.1369879Z     scale_ub=None,
2025-05-07T20:33:37.1369967Z     contiguous=True,
2025-05-07T20:33:37.1370045Z     compiled=True,
2025-05-07T20:33:37.1370114Z )
2025-05-07T20:33:37.1370331Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1370541Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:37.1370546Z 
2025-05-07T20:33:37.1370622Z     @given(
2025-05-07T20:33:37.1370733Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1370823Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1370936Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1371047Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1371160Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1371231Z     )
2025-05-07T20:33:37.1371471Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1371598Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1371675Z         self,
2025-05-07T20:33:37.1371744Z         T: int,
2025-05-07T20:33:37.1371813Z         D: int,
2025-05-07T20:33:37.1371910Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1371993Z         contiguous: bool,
2025-05-07T20:33:37.1372080Z         compiled: bool,
2025-05-07T20:33:37.1372152Z     ) -> None:
2025-05-07T20:33:37.1372241Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1372314Z     
2025-05-07T20:33:37.1372476Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1372543Z     
2025-05-07T20:33:37.1372631Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1372753Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1372879Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1372955Z         x0 = x[:, :D]
2025-05-07T20:33:37.1373030Z         x1 = x[:, D:]
2025-05-07T20:33:37.1373096Z     
2025-05-07T20:33:37.1373181Z         if contiguous:
2025-05-07T20:33:37.1373271Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1373359Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1373424Z     
2025-05-07T20:33:37.1373507Z         if scale_ub is not None:
2025-05-07T20:33:37.1373610Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1373739Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1373814Z             )
2025-05-07T20:33:37.1373891Z         else:
2025-05-07T20:33:37.1373978Z             scale_ub_tensor = None
2025-05-07T20:33:37.1374044Z     
2025-05-07T20:33:37.1374170Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1374254Z             op = silu_mul_quant
2025-05-07T20:33:37.1374334Z             if compiled:
2025-05-07T20:33:37.1374435Z                 op = torch.compile(op)
2025-05-07T20:33:37.1374535Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1374599Z     
2025-05-07T20:33:37.1374689Z         y_fp8, y_scale = fn()
2025-05-07T20:33:37.1374804Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:37.1374872Z     
2025-05-07T20:33:37.1374998Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1375092Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:37.1375190Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:37.1375307Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:37.1375439Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:37.1375514Z     
2025-05-07T20:33:37.1375606Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:37.1375611Z 
2025-05-07T20:33:37.1375704Z moe/activation_test.py:126: 
2025-05-07T20:33:37.1375823Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1375923Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:37.1376051Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:37.1376656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:37.1376751Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:37.1377114Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1377372Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1377748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:37.1378000Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:37.1378367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:37.1378534Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:37.1378903Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:37.1378976Z     fn()
2025-05-07T20:33:37.1379381Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:37.1379456Z     self.fn.run(
2025-05-07T20:33:37.1379793Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1379879Z     kernel = self.compile(
2025-05-07T20:33:37.1380253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1380427Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1380545Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1380590Z 
2025-05-07T20:33:37.1380792Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cbcdaa80>
2025-05-07T20:33:37.1381556Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1382079Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89cbd2b060>}
2025-05-07T20:33:37.1382856Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1383039Z context = <triton._C.libtriton.ir.context object at 0x7f89f4398d30>
2025-05-07T20:33:37.1383047Z 
2025-05-07T20:33:37.1383205Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1383463Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1383572Z                            module_map=module_map)
2025-05-07T20:33:37.1383730Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1383825Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:37.1383900Z E       ^
2025-05-07T20:33:37.1384248Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1384255Z 
2025-05-07T20:33:37.1384661Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1384666Z 
2025-05-07T20:33:37.1384763Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1384980Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1385054Z     T=2048,
2025-05-07T20:33:37.1385127Z     D=5120,
2025-05-07T20:33:37.1385205Z     scale_ub=None,
2025-05-07T20:33:37.1385332Z     contiguous=True,
2025-05-07T20:33:37.1385408Z     compiled=True,
2025-05-07T20:33:37.1385476Z )
2025-05-07T20:33:37.1385695Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1385860Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:37.1385865Z 
2025-05-07T20:33:37.1385934Z     @given(
2025-05-07T20:33:37.1386090Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1386182Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1386289Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1386401Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1386506Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1386578Z     )
2025-05-07T20:33:37.1386821Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1386913Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1386988Z         self,
2025-05-07T20:33:37.1387062Z         T: int,
2025-05-07T20:33:37.1387130Z         D: int,
2025-05-07T20:33:37.1387265Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1387347Z         contiguous: bool,
2025-05-07T20:33:37.1387475Z         compiled: bool,
2025-05-07T20:33:37.1387553Z     ) -> None:
2025-05-07T20:33:37.1387638Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1387707Z     
2025-05-07T20:33:37.1387869Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1387935Z     
2025-05-07T20:33:37.1388020Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1388137Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1388216Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1388293Z         x0 = x[:, :D]
2025-05-07T20:33:37.1388365Z         x1 = x[:, D:]
2025-05-07T20:33:37.1388487Z     
2025-05-07T20:33:37.1388568Z         if contiguous:
2025-05-07T20:33:37.1388652Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1388735Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1388804Z     
2025-05-07T20:33:37.1388888Z         if scale_ub is not None:
2025-05-07T20:33:37.1388987Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1389116Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1389185Z             )
2025-05-07T20:33:37.1389253Z         else:
2025-05-07T20:33:37.1389346Z             scale_ub_tensor = None
2025-05-07T20:33:37.1389413Z     
2025-05-07T20:33:37.1389544Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1389627Z             op = silu_mul_quant
2025-05-07T20:33:37.1389709Z             if compiled:
2025-05-07T20:33:37.1389801Z                 op = torch.compile(op)
2025-05-07T20:33:37.1389899Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1389969Z     
2025-05-07T20:33:37.1390064Z         y_fp8, y_scale = fn()
2025-05-07T20:33:37.1390181Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:37.1390245Z     
2025-05-07T20:33:37.1390387Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1390486Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:37.1390592Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:37.1390708Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:37.1390841Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:37.1390921Z     
2025-05-07T20:33:37.1391014Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:37.1391018Z 
2025-05-07T20:33:37.1391112Z moe/activation_test.py:126: 
2025-05-07T20:33:37.1391234Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1391330Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:37.1391456Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:37.1392068Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:37.1392164Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:37.1392534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1392750Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1393176Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:37.1393429Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:37.1393813Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:37.1393978Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:37.1394312Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:37.1394386Z     fn()
2025-05-07T20:33:37.1394846Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:37.1394927Z     self.fn.run(
2025-05-07T20:33:37.1395260Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1395356Z     kernel = self.compile(
2025-05-07T20:33:37.1395748Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1395918Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1396037Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1396041Z 
2025-05-07T20:33:37.1396241Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cbcdab70>
2025-05-07T20:33:37.1397101Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1397624Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89cb52ff60>}
2025-05-07T20:33:37.1398355Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1398543Z context = <triton._C.libtriton.ir.context object at 0x7f89cb4205b0>
2025-05-07T20:33:37.1398548Z 
2025-05-07T20:33:37.1398707Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1398969Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1399072Z                            module_map=module_map)
2025-05-07T20:33:37.1399229Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1399327Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:37.1399397Z E       ^
2025-05-07T20:33:37.1399744Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1399749Z 
2025-05-07T20:33:37.1400157Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1400162Z 
2025-05-07T20:33:37.1400255Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1400477Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1400547Z     T=128,
2025-05-07T20:33:37.1400620Z     D=5120,
2025-05-07T20:33:37.1400695Z     scale_ub=None,
2025-05-07T20:33:37.1400776Z     contiguous=True,
2025-05-07T20:33:37.1400855Z     compiled=True,
2025-05-07T20:33:37.1400922Z )
2025-05-07T20:33:37.1401175Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1401345Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:37.1401349Z 
2025-05-07T20:33:37.1401417Z     @given(
2025-05-07T20:33:37.1401528Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1401625Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1401773Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1401885Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1401990Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1402058Z     )
2025-05-07T20:33:37.1402298Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1402383Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1402454Z         self,
2025-05-07T20:33:37.1402525Z         T: int,
2025-05-07T20:33:37.1402593Z         D: int,
2025-05-07T20:33:37.1402681Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1402765Z         contiguous: bool,
2025-05-07T20:33:37.1402882Z         compiled: bool,
2025-05-07T20:33:37.1402956Z     ) -> None:
2025-05-07T20:33:37.1403050Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1403114Z     
2025-05-07T20:33:37.1403279Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1403350Z     
2025-05-07T20:33:37.1403440Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1403562Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1403643Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1403719Z         x0 = x[:, :D]
2025-05-07T20:33:37.1403796Z         x1 = x[:, D:]
2025-05-07T20:33:37.1403861Z     
2025-05-07T20:33:37.1403934Z         if contiguous:
2025-05-07T20:33:37.1404021Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1404147Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1404212Z     
2025-05-07T20:33:37.1404298Z         if scale_ub is not None:
2025-05-07T20:33:37.1404399Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1404533Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1404605Z             )
2025-05-07T20:33:37.1404675Z         else:
2025-05-07T20:33:37.1404764Z             scale_ub_tensor = None
2025-05-07T20:33:37.1404832Z     
2025-05-07T20:33:37.1404953Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1405041Z             op = silu_mul_quant
2025-05-07T20:33:37.1405120Z             if compiled:
2025-05-07T20:33:37.1405211Z                 op = torch.compile(op)
2025-05-07T20:33:37.1405310Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1405376Z     
2025-05-07T20:33:37.1405459Z         y_fp8, y_scale = fn()
2025-05-07T20:33:37.1405577Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:37.1405648Z     
2025-05-07T20:33:37.1405782Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1405876Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:37.1405972Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:37.1406089Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:37.1406222Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:37.1406288Z     
2025-05-07T20:33:37.1406383Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:37.1406390Z 
2025-05-07T20:33:37.1406483Z moe/activation_test.py:126: 
2025-05-07T20:33:37.1406603Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1406705Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:37.1406834Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:37.1407420Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:37.1407516Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:37.1407921Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1408143Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1408500Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:37.1408755Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:37.1409162Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:37.1409320Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:37.1409656Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:37.1409729Z     fn()
2025-05-07T20:33:37.1410118Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:37.1410197Z     self.fn.run(
2025-05-07T20:33:37.1410564Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1410653Z     kernel = self.compile(
2025-05-07T20:33:37.1411046Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1411214Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1411338Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1411343Z 
2025-05-07T20:33:37.1411539Z self = <triton.compiler.compiler.ASTSource object at 0x7f89f40c9ef0>
2025-05-07T20:33:37.1412347Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1412902Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89cb315f80>}
2025-05-07T20:33:37.1413632Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1413817Z context = <triton._C.libtriton.ir.context object at 0x7f89cb244170>
2025-05-07T20:33:37.1413822Z 
2025-05-07T20:33:37.1413978Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1414236Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1414337Z                            module_map=module_map)
2025-05-07T20:33:37.1414495Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1414590Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:37.1414662Z E       ^
2025-05-07T20:33:37.1415014Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1415021Z 
2025-05-07T20:33:37.1415445Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1415452Z 
2025-05-07T20:33:37.1415548Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1415766Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1415838Z     T=4096,
2025-05-07T20:33:37.1415907Z     D=5120,
2025-05-07T20:33:37.1415989Z     scale_ub=None,
2025-05-07T20:33:37.1416067Z     contiguous=True,
2025-05-07T20:33:37.1416139Z     compiled=True,
2025-05-07T20:33:37.1416209Z )
2025-05-07T20:33:37.1416425Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1416594Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:37.1416640Z 
2025-05-07T20:33:37.1416711Z     @given(
2025-05-07T20:33:37.1416829Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1416923Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1417030Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1417142Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1417296Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1417361Z     )
2025-05-07T20:33:37.1417602Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1417689Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1417758Z         self,
2025-05-07T20:33:37.1417831Z         T: int,
2025-05-07T20:33:37.1417900Z         D: int,
2025-05-07T20:33:37.1417988Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1418077Z         contiguous: bool,
2025-05-07T20:33:37.1418154Z         compiled: bool,
2025-05-07T20:33:37.1418223Z     ) -> None:
2025-05-07T20:33:37.1418313Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1418377Z     
2025-05-07T20:33:37.1418576Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1418647Z     
2025-05-07T20:33:37.1418736Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1418852Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1418940Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1419012Z         x0 = x[:, :D]
2025-05-07T20:33:37.1419089Z         x1 = x[:, D:]
2025-05-07T20:33:37.1419155Z     
2025-05-07T20:33:37.1419229Z         if contiguous:
2025-05-07T20:33:37.1419316Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1419396Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1419459Z     
2025-05-07T20:33:37.1419543Z         if scale_ub is not None:
2025-05-07T20:33:37.1419685Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1419812Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1419889Z             )
2025-05-07T20:33:37.1419959Z         else:
2025-05-07T20:33:37.1420048Z             scale_ub_tensor = None
2025-05-07T20:33:37.1420116Z     
2025-05-07T20:33:37.1420239Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1420323Z             op = silu_mul_quant
2025-05-07T20:33:37.1420403Z             if compiled:
2025-05-07T20:33:37.1420500Z                 op = torch.compile(op)
2025-05-07T20:33:37.1420602Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1420668Z     
2025-05-07T20:33:37.1420751Z         y_fp8, y_scale = fn()
2025-05-07T20:33:37.1420868Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:37.1420934Z     
2025-05-07T20:33:37.1421062Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1421158Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:37.1421257Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:37.1421368Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:37.1421507Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:37.1421574Z     
2025-05-07T20:33:37.1421670Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:37.1421674Z 
2025-05-07T20:33:37.1421763Z moe/activation_test.py:126: 
2025-05-07T20:33:37.1421882Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1421986Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:37.1422111Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:37.1422653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:37.1422748Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:37.1423115Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1423405Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1423773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:37.1424023Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:37.1424411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:37.1424610Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:37.1424949Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:37.1425022Z     fn()
2025-05-07T20:33:37.1425433Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:37.1425518Z     self.fn.run(
2025-05-07T20:33:37.1425851Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1425938Z     kernel = self.compile(
2025-05-07T20:33:37.1426360Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1426532Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1426655Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1426662Z 
2025-05-07T20:33:37.1426860Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cb27eb30>
2025-05-07T20:33:37.1427718Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1428257Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89cb4be340>}
2025-05-07T20:33:37.1429031Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1429217Z context = <triton._C.libtriton.ir.context object at 0x7f89cb139230>
2025-05-07T20:33:37.1429225Z 
2025-05-07T20:33:37.1429380Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1429636Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1429742Z                            module_map=module_map)
2025-05-07T20:33:37.1429896Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1429995Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:37.1430070Z E       ^
2025-05-07T20:33:37.1430433Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1430437Z 
2025-05-07T20:33:37.1430865Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1430870Z 
2025-05-07T20:33:37.1430968Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1431190Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1431264Z     T=16384,
2025-05-07T20:33:37.1431332Z     D=5120,
2025-05-07T20:33:37.1431410Z     scale_ub=None,
2025-05-07T20:33:37.1431489Z     contiguous=True,
2025-05-07T20:33:37.1431562Z     compiled=True,
2025-05-07T20:33:37.1431632Z )
2025-05-07T20:33:37.1431846Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1432013Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:37.1432020Z 
2025-05-07T20:33:37.1432097Z     @given(
2025-05-07T20:33:37.1432211Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1432350Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1432463Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1432576Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1432686Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1432757Z     )
2025-05-07T20:33:37.1432998Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1433123Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1433195Z         self,
2025-05-07T20:33:37.1433266Z         T: int,
2025-05-07T20:33:37.1433340Z         D: int,
2025-05-07T20:33:37.1433430Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1433512Z         contiguous: bool,
2025-05-07T20:33:37.1433597Z         compiled: bool,
2025-05-07T20:33:37.1433672Z     ) -> None:
2025-05-07T20:33:37.1433763Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1433830Z     
2025-05-07T20:33:37.1433992Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1434060Z     
2025-05-07T20:33:37.1434181Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1434302Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1434386Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1434460Z         x0 = x[:, :D]
2025-05-07T20:33:37.1434535Z         x1 = x[:, D:]
2025-05-07T20:33:37.1434609Z     
2025-05-07T20:33:37.1434683Z         if contiguous:
2025-05-07T20:33:37.1434767Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1434852Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1434918Z     
2025-05-07T20:33:37.1435007Z         if scale_ub is not None:
2025-05-07T20:33:37.1435105Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1435231Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1435342Z             )
2025-05-07T20:33:37.1435415Z         else:
2025-05-07T20:33:37.1435503Z             scale_ub_tensor = None
2025-05-07T20:33:37.1435576Z     
2025-05-07T20:33:37.1435701Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1435787Z             op = silu_mul_quant
2025-05-07T20:33:37.1435868Z             if compiled:
2025-05-07T20:33:37.1435963Z                 op = torch.compile(op)
2025-05-07T20:33:37.1436065Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1436134Z     
2025-05-07T20:33:37.1436220Z         y_fp8, y_scale = fn()
2025-05-07T20:33:37.1436338Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:37.1436403Z     
2025-05-07T20:33:37.1436529Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1436630Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:37.1436723Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:37.1436835Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:37.1436973Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:37.1437041Z     
2025-05-07T20:33:37.1437135Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:37.1437140Z 
2025-05-07T20:33:37.1437237Z moe/activation_test.py:126: 
2025-05-07T20:33:37.1437356Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1437459Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:37.1437588Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:37.1438164Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:37.1438263Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:37.1438630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1438848Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1439269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:37.1439521Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:37.1439896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:37.1440257Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:37.1440719Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:37.1440799Z     fn()
2025-05-07T20:33:37.1441204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:37.1441284Z     self.fn.run(
2025-05-07T20:33:37.1441615Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1441708Z     kernel = self.compile(
2025-05-07T20:33:37.1442089Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1442323Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1442444Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1442449Z 
2025-05-07T20:33:37.1442648Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cb262000>
2025-05-07T20:33:37.1443409Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1443933Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89cb8a94e0>}
2025-05-07T20:33:37.1444765Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1444951Z context = <triton._C.libtriton.ir.context object at 0x7f89cab629b0>
2025-05-07T20:33:37.1444956Z 
2025-05-07T20:33:37.1445112Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1445365Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1445476Z                            module_map=module_map)
2025-05-07T20:33:37.1445631Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1445724Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:37.1445795Z E       ^
2025-05-07T20:33:37.1446150Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1446157Z 
2025-05-07T20:33:37.1446588Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1446593Z 
2025-05-07T20:33:37.1446692Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1446909Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1446984Z     T=1,
2025-05-07T20:33:37.1447055Z     D=5120,
2025-05-07T20:33:37.1447128Z     scale_ub=1200.0,
2025-05-07T20:33:37.1447208Z     contiguous=True,
2025-05-07T20:33:37.1447282Z     compiled=True,
2025-05-07T20:33:37.1447352Z )
2025-05-07T20:33:37.1447566Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1447722Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:37.1447727Z 
2025-05-07T20:33:37.1447800Z     @given(
2025-05-07T20:33:37.1447910Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1448004Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1448115Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1448289Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1448401Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1448476Z     )
2025-05-07T20:33:37.1448721Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1448809Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1448920Z         self,
2025-05-07T20:33:37.1448991Z         T: int,
2025-05-07T20:33:37.1449062Z         D: int,
2025-05-07T20:33:37.1449152Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1449234Z         contiguous: bool,
2025-05-07T20:33:37.1449313Z         compiled: bool,
2025-05-07T20:33:37.1449384Z     ) -> None:
2025-05-07T20:33:37.1449468Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1449539Z     
2025-05-07T20:33:37.1449698Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1449769Z     
2025-05-07T20:33:37.1449855Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1449976Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1450103Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1450179Z         x0 = x[:, :D]
2025-05-07T20:33:37.1450252Z         x1 = x[:, D:]
2025-05-07T20:33:37.1450321Z     
2025-05-07T20:33:37.1450398Z         if contiguous:
2025-05-07T20:33:37.1450480Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1450566Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1450632Z     
2025-05-07T20:33:37.1450713Z         if scale_ub is not None:
2025-05-07T20:33:37.1450812Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1450943Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1451014Z             )
2025-05-07T20:33:37.1451090Z         else:
2025-05-07T20:33:37.1451178Z             scale_ub_tensor = None
2025-05-07T20:33:37.1451316Z     
2025-05-07T20:33:37.1451441Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1451524Z             op = silu_mul_quant
2025-05-07T20:33:37.1451615Z             if compiled:
2025-05-07T20:33:37.1451711Z                 op = torch.compile(op)
2025-05-07T20:33:37.1451809Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1451881Z     
2025-05-07T20:33:37.1451965Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1451969Z 
2025-05-07T20:33:37.1452061Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1452188Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1452281Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1452374Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1452752Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:37.1452840Z     return fn(*args, **kwargs)
2025-05-07T20:33:37.1453325Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1453423Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1453777Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1453999Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1454332Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1454424Z     kernel = self.compile(
2025-05-07T20:33:37.1454801Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1454968Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1455088Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1455095Z 
2025-05-07T20:33:37.1455291Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cb1f4410>
2025-05-07T20:33:37.1456102Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1456624Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89cb8f9d00>}
2025-05-07T20:33:37.1457433Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1457622Z context = <triton._C.libtriton.ir.context object at 0x7f89cabfbdf0>
2025-05-07T20:33:37.1457626Z 
2025-05-07T20:33:37.1457783Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1458051Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1468049Z                            module_map=module_map)
2025-05-07T20:33:37.1468302Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1468402Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1468473Z E       ^
2025-05-07T20:33:37.1468827Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1468836Z 
2025-05-07T20:33:37.1469259Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1469264Z 
2025-05-07T20:33:37.1469362Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1469580Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1469654Z     T=1,
2025-05-07T20:33:37.1469772Z     D=5120,
2025-05-07T20:33:37.1469853Z     scale_ub=None,
2025-05-07T20:33:37.1469934Z     contiguous=False,
2025-05-07T20:33:37.1470012Z     compiled=True,
2025-05-07T20:33:37.1470081Z )
2025-05-07T20:33:37.1470295Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1470457Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:37.1470462Z 
2025-05-07T20:33:37.1470534Z     @given(
2025-05-07T20:33:37.1470647Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1470745Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1470853Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1470961Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1471069Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1471137Z     )
2025-05-07T20:33:37.1471375Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1471465Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1471536Z         self,
2025-05-07T20:33:37.1471608Z         T: int,
2025-05-07T20:33:37.1471678Z         D: int,
2025-05-07T20:33:37.1471769Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1471855Z         contiguous: bool,
2025-05-07T20:33:37.1471933Z         compiled: bool,
2025-05-07T20:33:37.1472004Z     ) -> None:
2025-05-07T20:33:37.1472094Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1472161Z     
2025-05-07T20:33:37.1472323Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1472399Z     
2025-05-07T20:33:37.1472485Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1472607Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1472691Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1472765Z         x0 = x[:, :D]
2025-05-07T20:33:37.1472842Z         x1 = x[:, D:]
2025-05-07T20:33:37.1472908Z     
2025-05-07T20:33:37.1472983Z         if contiguous:
2025-05-07T20:33:37.1473072Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1473153Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1473221Z     
2025-05-07T20:33:37.1473355Z         if scale_ub is not None:
2025-05-07T20:33:37.1473458Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1473586Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1473659Z             )
2025-05-07T20:33:37.1473732Z         else:
2025-05-07T20:33:37.1473818Z             scale_ub_tensor = None
2025-05-07T20:33:37.1473884Z     
2025-05-07T20:33:37.1474049Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1474136Z             op = silu_mul_quant
2025-05-07T20:33:37.1474214Z             if compiled:
2025-05-07T20:33:37.1474306Z                 op = torch.compile(op)
2025-05-07T20:33:37.1474404Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1474471Z     
2025-05-07T20:33:37.1474555Z         y_fp8, y_scale = fn()
2025-05-07T20:33:37.1474676Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:37.1474741Z     
2025-05-07T20:33:37.1474868Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1474971Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:37.1475107Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:37.1475223Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:37.1475354Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:37.1475423Z     
2025-05-07T20:33:37.1475525Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:37.1475529Z 
2025-05-07T20:33:37.1475620Z moe/activation_test.py:126: 
2025-05-07T20:33:37.1475740Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1475841Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:37.1475968Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:37.1476521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:37.1476659Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:37.1477018Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1477234Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1477594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:37.1477845Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:37.1478234Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:37.1478397Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:37.1478733Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:37.1478803Z     fn()
2025-05-07T20:33:37.1479203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:37.1479284Z     self.fn.run(
2025-05-07T20:33:37.1479614Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1479700Z     kernel = self.compile(
2025-05-07T20:33:37.1480077Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1480245Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1480369Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1480373Z 
2025-05-07T20:33:37.1480571Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cb1f6b10>
2025-05-07T20:33:37.1481378Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1481876Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca706de0>}
2025-05-07T20:33:37.1482605Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1482831Z context = <triton._C.libtriton.ir.context object at 0x7f89ca58d770>
2025-05-07T20:33:37.1482835Z 
2025-05-07T20:33:37.1482990Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1483244Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1483348Z                            module_map=module_map)
2025-05-07T20:33:37.1483506Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1483605Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:37.1483676Z E       ^
2025-05-07T20:33:37.1484061Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1484066Z 
2025-05-07T20:33:37.1484477Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1484487Z 
2025-05-07T20:33:37.1484582Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1484797Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1484869Z     T=1,
2025-05-07T20:33:37.1484941Z     D=5120,
2025-05-07T20:33:37.1485020Z     scale_ub=None,
2025-05-07T20:33:37.1485101Z     contiguous=True,
2025-05-07T20:33:37.1485176Z     compiled=False,
2025-05-07T20:33:37.1485284Z )
2025-05-07T20:33:37.1485493Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1485651Z T = 1, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:37.1485660Z 
2025-05-07T20:33:37.1485735Z     @given(
2025-05-07T20:33:37.1485846Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1485940Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1486047Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1486157Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1486266Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1486334Z     )
2025-05-07T20:33:37.1486569Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1486661Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1486732Z         self,
2025-05-07T20:33:37.1486804Z         T: int,
2025-05-07T20:33:37.1486883Z         D: int,
2025-05-07T20:33:37.1486971Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1487055Z         contiguous: bool,
2025-05-07T20:33:37.1487135Z         compiled: bool,
2025-05-07T20:33:37.1487209Z     ) -> None:
2025-05-07T20:33:37.1487302Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1487366Z     
2025-05-07T20:33:37.1487526Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1487596Z     
2025-05-07T20:33:37.1487679Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1487796Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1487884Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1487957Z         x0 = x[:, :D]
2025-05-07T20:33:37.1488029Z         x1 = x[:, D:]
2025-05-07T20:33:37.1488096Z     
2025-05-07T20:33:37.1488173Z         if contiguous:
2025-05-07T20:33:37.1488263Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1488346Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1488411Z     
2025-05-07T20:33:37.1488499Z         if scale_ub is not None:
2025-05-07T20:33:37.1488596Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1488773Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1488851Z             )
2025-05-07T20:33:37.1488925Z         else:
2025-05-07T20:33:37.1489012Z             scale_ub_tensor = None
2025-05-07T20:33:37.1489081Z     
2025-05-07T20:33:37.1489202Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1489287Z             op = silu_mul_quant
2025-05-07T20:33:37.1489406Z             if compiled:
2025-05-07T20:33:37.1489501Z                 op = torch.compile(op)
2025-05-07T20:33:37.1489601Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1489668Z     
2025-05-07T20:33:37.1489751Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1489755Z 
2025-05-07T20:33:37.1489850Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1489970Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1490066Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1490169Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1490800Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1490892Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1491248Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1491462Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1491795Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1491882Z     kernel = self.compile(
2025-05-07T20:33:37.1492256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1492425Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1492584Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1492589Z 
2025-05-07T20:33:37.1492789Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cac16fc0>
2025-05-07T20:33:37.1493551Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1494042Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89cb4be700>}
2025-05-07T20:33:37.1494772Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1494959Z context = <triton._C.libtriton.ir.context object at 0x7f89ca202830>
2025-05-07T20:33:37.1494964Z 
2025-05-07T20:33:37.1495123Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1495389Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1495489Z                            module_map=module_map)
2025-05-07T20:33:37.1495646Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1495737Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1495817Z E       ^
2025-05-07T20:33:37.1496163Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1496168Z 
2025-05-07T20:33:37.1496569Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1496574Z 
2025-05-07T20:33:37.1496672Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1496887Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1496963Z     T=128,
2025-05-07T20:33:37.1497077Z     D=5120,
2025-05-07T20:33:37.1497152Z     scale_ub=None,
2025-05-07T20:33:37.1497238Z     contiguous=False,
2025-05-07T20:33:37.1497311Z     compiled=True,
2025-05-07T20:33:37.1497377Z )
2025-05-07T20:33:37.1497589Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1497751Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:37.1497794Z 
2025-05-07T20:33:37.1497866Z     @given(
2025-05-07T20:33:37.1497983Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1498076Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1498183Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1498296Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1498403Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1498481Z     )
2025-05-07T20:33:37.1498717Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1498808Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1498882Z         self,
2025-05-07T20:33:37.1498992Z         T: int,
2025-05-07T20:33:37.1499062Z         D: int,
2025-05-07T20:33:37.1499159Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1499242Z         contiguous: bool,
2025-05-07T20:33:37.1499321Z         compiled: bool,
2025-05-07T20:33:37.1499400Z     ) -> None:
2025-05-07T20:33:37.1499487Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1499555Z     
2025-05-07T20:33:37.1499721Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1499794Z     
2025-05-07T20:33:37.1499882Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1499997Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1500078Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1500201Z         x0 = x[:, :D]
2025-05-07T20:33:37.1500275Z         x1 = x[:, D:]
2025-05-07T20:33:37.1500343Z     
2025-05-07T20:33:37.1500422Z         if contiguous:
2025-05-07T20:33:37.1500509Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1500592Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1500661Z     
2025-05-07T20:33:37.1500746Z         if scale_ub is not None:
2025-05-07T20:33:37.1500841Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1500972Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1501045Z             )
2025-05-07T20:33:37.1501118Z         else:
2025-05-07T20:33:37.1501207Z             scale_ub_tensor = None
2025-05-07T20:33:37.1501273Z     
2025-05-07T20:33:37.1501398Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1501481Z             op = silu_mul_quant
2025-05-07T20:33:37.1501560Z             if compiled:
2025-05-07T20:33:37.1501656Z                 op = torch.compile(op)
2025-05-07T20:33:37.1501758Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1501824Z     
2025-05-07T20:33:37.1501913Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1501917Z 
2025-05-07T20:33:37.1502009Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1502136Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1502227Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1502317Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1502680Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:37.1502767Z     return fn(*args, **kwargs)
2025-05-07T20:33:37.1503249Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1503343Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1503693Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1503915Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1504293Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1504383Z     kernel = self.compile(
2025-05-07T20:33:37.1504758Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1504924Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1505114Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1505119Z 
2025-05-07T20:33:37.1505318Z self = <triton.compiler.compiler.ASTSource object at 0x7f89ca708e10>
2025-05-07T20:33:37.1506076Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1506575Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca707880>}
2025-05-07T20:33:37.1507344Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1507589Z context = <triton._C.libtriton.ir.context object at 0x7f89ca5e76f0>
2025-05-07T20:33:37.1507597Z 
2025-05-07T20:33:37.1507751Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1508004Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1508107Z                            module_map=module_map)
2025-05-07T20:33:37.1508261Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1508394Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1508469Z E       ^
2025-05-07T20:33:37.1508816Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1508823Z 
2025-05-07T20:33:37.1509233Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1509238Z 
2025-05-07T20:33:37.1509332Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1509549Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1509624Z     T=128,
2025-05-07T20:33:37.1509693Z     D=7168,
2025-05-07T20:33:37.1509772Z     scale_ub=1200.0,
2025-05-07T20:33:37.1509852Z     contiguous=False,
2025-05-07T20:33:37.1509927Z     compiled=False,
2025-05-07T20:33:37.1509998Z )
2025-05-07T20:33:37.1510207Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1510376Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:37.1510381Z 
2025-05-07T20:33:37.1510457Z     @given(
2025-05-07T20:33:37.1510570Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1510665Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1510776Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1510885Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1510994Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1511083Z     )
2025-05-07T20:33:37.1511346Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1511434Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1511505Z         self,
2025-05-07T20:33:37.1511575Z         T: int,
2025-05-07T20:33:37.1511648Z         D: int,
2025-05-07T20:33:37.1511737Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1511817Z         contiguous: bool,
2025-05-07T20:33:37.1511905Z         compiled: bool,
2025-05-07T20:33:37.1511975Z     ) -> None:
2025-05-07T20:33:37.1512061Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1512176Z     
2025-05-07T20:33:37.1512339Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1512407Z     
2025-05-07T20:33:37.1512492Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1512608Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1512693Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1512805Z         x0 = x[:, :D]
2025-05-07T20:33:37.1512875Z         x1 = x[:, D:]
2025-05-07T20:33:37.1512949Z     
2025-05-07T20:33:37.1513028Z         if contiguous:
2025-05-07T20:33:37.1513112Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1513197Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1513265Z     
2025-05-07T20:33:37.1513348Z         if scale_ub is not None:
2025-05-07T20:33:37.1513451Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1513583Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1513655Z             )
2025-05-07T20:33:37.1513730Z         else:
2025-05-07T20:33:37.1513822Z             scale_ub_tensor = None
2025-05-07T20:33:37.1513894Z     
2025-05-07T20:33:37.1514056Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1514142Z             op = silu_mul_quant
2025-05-07T20:33:37.1514227Z             if compiled:
2025-05-07T20:33:37.1514320Z                 op = torch.compile(op)
2025-05-07T20:33:37.1514424Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1514495Z     
2025-05-07T20:33:37.1514581Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1514585Z 
2025-05-07T20:33:37.1514675Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1514797Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1514891Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1514987Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1515512Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1515605Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1515963Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1516177Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1516511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1516603Z     kernel = self.compile(
2025-05-07T20:33:37.1516979Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1517147Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1517266Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1517273Z 
2025-05-07T20:33:37.1517467Z self = <triton.compiler.compiler.ASTSource object at 0x7f89caff54f0>
2025-05-07T20:33:37.1518232Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1518721Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca98c7c0>}
2025-05-07T20:33:37.1519455Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1519638Z context = <triton._C.libtriton.ir.context object at 0x7f89ca413c30>
2025-05-07T20:33:37.1519643Z 
2025-05-07T20:33:37.1519801Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1520093Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1520196Z                            module_map=module_map)
2025-05-07T20:33:37.1520357Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1520464Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1520547Z E       ^
2025-05-07T20:33:37.1520919Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1520984Z 
2025-05-07T20:33:37.1521412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1521417Z 
2025-05-07T20:33:37.1521517Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1521732Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1521808Z     T=128,
2025-05-07T20:33:37.1521882Z     D=5120,
2025-05-07T20:33:37.1521958Z     scale_ub=None,
2025-05-07T20:33:37.1522039Z     contiguous=False,
2025-05-07T20:33:37.1522123Z     compiled=False,
2025-05-07T20:33:37.1522191Z )
2025-05-07T20:33:37.1522440Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1522607Z T = 128, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:37.1522612Z 
2025-05-07T20:33:37.1522683Z     @given(
2025-05-07T20:33:37.1522796Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1522891Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1522997Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1523113Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1523219Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1523289Z     )
2025-05-07T20:33:37.1523525Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1523655Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1523727Z         self,
2025-05-07T20:33:37.1523802Z         T: int,
2025-05-07T20:33:37.1523871Z         D: int,
2025-05-07T20:33:37.1523968Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1524050Z         contiguous: bool,
2025-05-07T20:33:37.1524127Z         compiled: bool,
2025-05-07T20:33:37.1524202Z     ) -> None:
2025-05-07T20:33:37.1524290Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1524357Z     
2025-05-07T20:33:37.1524525Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1524597Z     
2025-05-07T20:33:37.1524681Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1524801Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1524882Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1524955Z         x0 = x[:, :D]
2025-05-07T20:33:37.1525033Z         x1 = x[:, D:]
2025-05-07T20:33:37.1525101Z     
2025-05-07T20:33:37.1525190Z         if contiguous:
2025-05-07T20:33:37.1525274Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1525356Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1525425Z     
2025-05-07T20:33:37.1525507Z         if scale_ub is not None:
2025-05-07T20:33:37.1525607Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1525735Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1525806Z             )
2025-05-07T20:33:37.1525875Z         else:
2025-05-07T20:33:37.1525971Z             scale_ub_tensor = None
2025-05-07T20:33:37.1526042Z     
2025-05-07T20:33:37.1526167Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1526259Z             op = silu_mul_quant
2025-05-07T20:33:37.1526339Z             if compiled:
2025-05-07T20:33:37.1526435Z                 op = torch.compile(op)
2025-05-07T20:33:37.1526539Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1526606Z     
2025-05-07T20:33:37.1526696Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1526703Z 
2025-05-07T20:33:37.1526794Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1526963Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1527061Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1527157Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1527646Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1527742Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1528135Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1528355Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1528690Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1528777Z     kernel = self.compile(
2025-05-07T20:33:37.1529183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1529352Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1529516Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1529521Z 
2025-05-07T20:33:37.1529717Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cb8fbe30>
2025-05-07T20:33:37.1530473Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1530974Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89cb8fa7a0>}
2025-05-07T20:33:37.1531710Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1531941Z context = <triton._C.libtriton.ir.context object at 0x7f89ca42cc70>
2025-05-07T20:33:37.1531946Z 
2025-05-07T20:33:37.1532103Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1532358Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1532469Z                            module_map=module_map)
2025-05-07T20:33:37.1532627Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1532730Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1532804Z E       ^
2025-05-07T20:33:37.1533151Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1533156Z 
2025-05-07T20:33:37.1533594Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1533599Z 
2025-05-07T20:33:37.1533701Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1533926Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1534000Z     T=128,
2025-05-07T20:33:37.1534075Z     D=5120,
2025-05-07T20:33:37.1534162Z     scale_ub=1200.0,
2025-05-07T20:33:37.1534242Z     contiguous=True,
2025-05-07T20:33:37.1534321Z     compiled=False,
2025-05-07T20:33:37.1534394Z )
2025-05-07T20:33:37.1534604Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1534766Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:37.1534771Z 
2025-05-07T20:33:37.1534846Z     @given(
2025-05-07T20:33:37.1534958Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1535053Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1535168Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1535283Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1535439Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1535513Z     )
2025-05-07T20:33:37.1535754Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1535850Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1535921Z         self,
2025-05-07T20:33:37.1535995Z         T: int,
2025-05-07T20:33:37.1536076Z         D: int,
2025-05-07T20:33:37.1536208Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1536293Z         contiguous: bool,
2025-05-07T20:33:37.1536381Z         compiled: bool,
2025-05-07T20:33:37.1536455Z     ) -> None:
2025-05-07T20:33:37.1536544Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1536619Z     
2025-05-07T20:33:37.1536779Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1536853Z     
2025-05-07T20:33:37.1536944Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1537063Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1537154Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1537228Z         x0 = x[:, :D]
2025-05-07T20:33:37.1537338Z         x1 = x[:, D:]
2025-05-07T20:33:37.1537412Z     
2025-05-07T20:33:37.1537492Z         if contiguous:
2025-05-07T20:33:37.1537575Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1537664Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1537733Z     
2025-05-07T20:33:37.1537820Z         if scale_ub is not None:
2025-05-07T20:33:37.1537925Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1538052Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1538129Z             )
2025-05-07T20:33:37.1538202Z         else:
2025-05-07T20:33:37.1538288Z             scale_ub_tensor = None
2025-05-07T20:33:37.1538360Z     
2025-05-07T20:33:37.1538483Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1538610Z             op = silu_mul_quant
2025-05-07T20:33:37.1538697Z             if compiled:
2025-05-07T20:33:37.1538796Z                 op = torch.compile(op)
2025-05-07T20:33:37.1538898Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1538977Z     
2025-05-07T20:33:37.1539064Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1539069Z 
2025-05-07T20:33:37.1539162Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1539291Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1539391Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1539493Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1539985Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1540278Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1540725Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1540957Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1541356Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1541448Z     kernel = self.compile(
2025-05-07T20:33:37.1541829Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1542005Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1542131Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1542136Z 
2025-05-07T20:33:37.1542334Z self = <triton.compiler.compiler.ASTSource object at 0x7f89f4d81520>
2025-05-07T20:33:37.1543101Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1543698Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca480c20>}
2025-05-07T20:33:37.1544441Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1544684Z context = <triton._C.libtriton.ir.context object at 0x7f89ca44f430>
2025-05-07T20:33:37.1544689Z 
2025-05-07T20:33:37.1544854Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1545110Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1545216Z                            module_map=module_map)
2025-05-07T20:33:37.1545383Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1545483Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1545556Z E       ^
2025-05-07T20:33:37.1545966Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1545971Z 
2025-05-07T20:33:37.1546402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1546407Z 
2025-05-07T20:33:37.1546509Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1546731Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1546807Z     T=1,
2025-05-07T20:33:37.1546883Z     D=7168,
2025-05-07T20:33:37.1546962Z     scale_ub=1200.0,
2025-05-07T20:33:37.1547039Z     contiguous=True,
2025-05-07T20:33:37.1547119Z     compiled=True,
2025-05-07T20:33:37.1547188Z )
2025-05-07T20:33:37.1547461Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1547702Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:37.1547706Z 
2025-05-07T20:33:37.1547778Z     @given(
2025-05-07T20:33:37.1547895Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1547992Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1548099Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1548215Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1548325Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1548402Z     )
2025-05-07T20:33:37.1548647Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1548739Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1548820Z         self,
2025-05-07T20:33:37.1548892Z         T: int,
2025-05-07T20:33:37.1548969Z         D: int,
2025-05-07T20:33:37.1549071Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1549156Z         contiguous: bool,
2025-05-07T20:33:37.1549243Z         compiled: bool,
2025-05-07T20:33:37.1549322Z     ) -> None:
2025-05-07T20:33:37.1549410Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1549481Z     
2025-05-07T20:33:37.1549647Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1549715Z     
2025-05-07T20:33:37.1549800Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1549920Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1550001Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1550080Z         x0 = x[:, :D]
2025-05-07T20:33:37.1550153Z         x1 = x[:, D:]
2025-05-07T20:33:37.1550220Z     
2025-05-07T20:33:37.1550300Z         if contiguous:
2025-05-07T20:33:37.1550383Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1550464Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1550533Z     
2025-05-07T20:33:37.1550617Z         if scale_ub is not None:
2025-05-07T20:33:37.1550716Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1550850Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1550922Z             )
2025-05-07T20:33:37.1550997Z         else:
2025-05-07T20:33:37.1551158Z             scale_ub_tensor = None
2025-05-07T20:33:37.1551226Z     
2025-05-07T20:33:37.1551352Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1551441Z             op = silu_mul_quant
2025-05-07T20:33:37.1551521Z             if compiled:
2025-05-07T20:33:37.1551620Z                 op = torch.compile(op)
2025-05-07T20:33:37.1551764Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1551834Z     
2025-05-07T20:33:37.1551925Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1551929Z 
2025-05-07T20:33:37.1552023Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1552147Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1552249Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1552344Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1552712Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:37.1552809Z     return fn(*args, **kwargs)
2025-05-07T20:33:37.1553333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1553430Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1553785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1554005Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1554345Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1554434Z     kernel = self.compile(
2025-05-07T20:33:37.1554834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1555044Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1555169Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1555173Z 
2025-05-07T20:33:37.1555379Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cacdfed0>
2025-05-07T20:33:37.1556142Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1556641Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca481ee0>}
2025-05-07T20:33:37.1557373Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1557560Z context = <triton._C.libtriton.ir.context object at 0x7f89cbbec2b0>
2025-05-07T20:33:37.1557564Z 
2025-05-07T20:33:37.1557731Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1557994Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1558105Z                            module_map=module_map)
2025-05-07T20:33:37.1558263Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1558361Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1558442Z E       ^
2025-05-07T20:33:37.1558788Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1558793Z 
2025-05-07T20:33:37.1559225Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1559232Z 
2025-05-07T20:33:37.1559331Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1559548Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1559672Z     T=1,
2025-05-07T20:33:37.1559749Z     D=7168,
2025-05-07T20:33:37.1559833Z     scale_ub=1200.0,
2025-05-07T20:33:37.1559925Z     contiguous=False,
2025-05-07T20:33:37.1560006Z     compiled=True,
2025-05-07T20:33:37.1560074Z )
2025-05-07T20:33:37.1560296Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1560504Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:37.1560508Z 
2025-05-07T20:33:37.1560586Z     @given(
2025-05-07T20:33:37.1560698Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1560795Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1560910Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1561024Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1561137Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1561212Z     )
2025-05-07T20:33:37.1561457Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1561588Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1561668Z         self,
2025-05-07T20:33:37.1561742Z         T: int,
2025-05-07T20:33:37.1561821Z         D: int,
2025-05-07T20:33:37.1561914Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1561997Z         contiguous: bool,
2025-05-07T20:33:37.1562087Z         compiled: bool,
2025-05-07T20:33:37.1562162Z     ) -> None:
2025-05-07T20:33:37.1562249Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1562325Z     
2025-05-07T20:33:37.1562488Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1562558Z     
2025-05-07T20:33:37.1562651Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1562772Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1562902Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1562987Z         x0 = x[:, :D]
2025-05-07T20:33:37.1563062Z         x1 = x[:, D:]
2025-05-07T20:33:37.1563132Z     
2025-05-07T20:33:37.1563220Z         if contiguous:
2025-05-07T20:33:37.1563306Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1563393Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1563461Z     
2025-05-07T20:33:37.1563545Z         if scale_ub is not None:
2025-05-07T20:33:37.1563648Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1563777Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1563850Z             )
2025-05-07T20:33:37.1563924Z         else:
2025-05-07T20:33:37.1564012Z             scale_ub_tensor = None
2025-05-07T20:33:37.1564081Z     
2025-05-07T20:33:37.1564209Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1564293Z             op = silu_mul_quant
2025-05-07T20:33:37.1564370Z             if compiled:
2025-05-07T20:33:37.1564469Z                 op = torch.compile(op)
2025-05-07T20:33:37.1564567Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1564639Z     
2025-05-07T20:33:37.1564728Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1564733Z 
2025-05-07T20:33:37.1564825Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1564951Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1565046Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1565139Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1565503Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:37.1565591Z     return fn(*args, **kwargs)
2025-05-07T20:33:37.1566084Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1566176Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1566534Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1566803Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1567140Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1567231Z     kernel = self.compile(
2025-05-07T20:33:37.1567616Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1567828Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1567952Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1567957Z 
2025-05-07T20:33:37.1568159Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cb25e550>
2025-05-07T20:33:37.1568921Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1569461Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca482c00>}
2025-05-07T20:33:37.1570198Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1570394Z context = <triton._C.libtriton.ir.context object at 0x7f89cbb05370>
2025-05-07T20:33:37.1570398Z 
2025-05-07T20:33:37.1570557Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1570814Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1570917Z                            module_map=module_map)
2025-05-07T20:33:37.1571114Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1571216Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1571295Z E       ^
2025-05-07T20:33:37.1571647Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1571652Z 
2025-05-07T20:33:37.1572082Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1572087Z 
2025-05-07T20:33:37.1572186Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1572405Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1572478Z     T=1,
2025-05-07T20:33:37.1572552Z     D=7168,
2025-05-07T20:33:37.1572634Z     scale_ub=None,
2025-05-07T20:33:37.1572716Z     contiguous=False,
2025-05-07T20:33:37.1572795Z     compiled=True,
2025-05-07T20:33:37.1572864Z )
2025-05-07T20:33:37.1573075Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1573234Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:37.1573245Z 
2025-05-07T20:33:37.1573319Z     @given(
2025-05-07T20:33:37.1573433Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1573531Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1573637Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1573747Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1573860Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1573933Z     )
2025-05-07T20:33:37.1574168Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1574259Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1574331Z         self,
2025-05-07T20:33:37.1574403Z         T: int,
2025-05-07T20:33:37.1574479Z         D: int,
2025-05-07T20:33:37.1574571Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1574659Z         contiguous: bool,
2025-05-07T20:33:37.1574739Z         compiled: bool,
2025-05-07T20:33:37.1574813Z     ) -> None:
2025-05-07T20:33:37.1574948Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1575019Z     
2025-05-07T20:33:37.1575184Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1575261Z     
2025-05-07T20:33:37.1575349Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1575468Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1575599Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1575676Z         x0 = x[:, :D]
2025-05-07T20:33:37.1575753Z         x1 = x[:, D:]
2025-05-07T20:33:37.1575824Z     
2025-05-07T20:33:37.1575905Z         if contiguous:
2025-05-07T20:33:37.1575992Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1576084Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1576153Z     
2025-05-07T20:33:37.1576244Z         if scale_ub is not None:
2025-05-07T20:33:37.1576349Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1576480Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1576563Z             )
2025-05-07T20:33:37.1576643Z         else:
2025-05-07T20:33:37.1576772Z             scale_ub_tensor = None
2025-05-07T20:33:37.1576852Z     
2025-05-07T20:33:37.1576980Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1577067Z             op = silu_mul_quant
2025-05-07T20:33:37.1577151Z             if compiled:
2025-05-07T20:33:37.1577244Z                 op = torch.compile(op)
2025-05-07T20:33:37.1577345Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1577414Z     
2025-05-07T20:33:37.1577499Z         y_fp8, y_scale = fn()
2025-05-07T20:33:37.1577617Z         y = y_fp8.to(torch.float32) * y_scale[:, None]
2025-05-07T20:33:37.1577686Z     
2025-05-07T20:33:37.1577814Z         def ref_fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1577911Z             x0_fp32 = x0.to(torch.float32)
2025-05-07T20:33:37.1578069Z             x1_fp32 = x1.to(torch.float32)
2025-05-07T20:33:37.1578185Z             y = x0_fp32 * torch.sigmoid(x0_fp32) * x1_fp32
2025-05-07T20:33:37.1578327Z             return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:37.1578405Z     
2025-05-07T20:33:37.1578501Z >       y_fp8_ref, y_scale_ref = ref_fn()
2025-05-07T20:33:37.1578506Z 
2025-05-07T20:33:37.1578606Z moe/activation_test.py:126: 
2025-05-07T20:33:37.1578729Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1578839Z moe/activation_test.py:124: in ref_fn
2025-05-07T20:33:37.1578968Z     return triton_quantize_fp8_row(y, scale_ub_tensor)
2025-05-07T20:33:37.1579519Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gemm/triton_gemm/fp8_gemm.py:2370: in triton_quantize_fp8_row
2025-05-07T20:33:37.1579620Z     _kernel_quantize_fp8_row[grid](
2025-05-07T20:33:37.1579972Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1580196Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1580566Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:186: in run
2025-05-07T20:33:37.1580815Z     timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
2025-05-07T20:33:37.1581187Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:166: in _bench
2025-05-07T20:33:37.1581349Z     return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
2025-05-07T20:33:37.1581686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/testing.py:117: in do_bench
2025-05-07T20:33:37.1585226Z     fn()
2025-05-07T20:33:37.1585645Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:152: in kernel_call
2025-05-07T20:33:37.1585731Z     self.fn.run(
2025-05-07T20:33:37.1586131Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1586222Z     kernel = self.compile(
2025-05-07T20:33:37.1586603Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1586770Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1586893Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1586938Z 
2025-05-07T20:33:37.1587144Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cacdf8d0>
2025-05-07T20:33:37.1587975Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=2, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1588469Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca6f4180>}
2025-05-07T20:33:37.1589245Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1589432Z context = <triton._C.libtriton.ir.context object at 0x7f89ca637ab0>
2025-05-07T20:33:37.1589440Z 
2025-05-07T20:33:37.1589595Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1589849Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1589953Z                            module_map=module_map)
2025-05-07T20:33:37.1590109Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1590204Z E       def _kernel_quantize_fp8_row(
2025-05-07T20:33:37.1590323Z E       ^
2025-05-07T20:33:37.1590710Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1590719Z 
2025-05-07T20:33:37.1591146Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1591151Z 
2025-05-07T20:33:37.1591246Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1591460Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1591542Z     T=1,
2025-05-07T20:33:37.1591612Z     D=5120,
2025-05-07T20:33:37.1591690Z     scale_ub=1200.0,
2025-05-07T20:33:37.1591773Z     contiguous=False,
2025-05-07T20:33:37.1591848Z     compiled=True,
2025-05-07T20:33:37.1591918Z )
2025-05-07T20:33:37.1592127Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1592287Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:37.1592295Z 
2025-05-07T20:33:37.1592370Z     @given(
2025-05-07T20:33:37.1592483Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1592577Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1592692Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1592802Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1592910Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1592976Z     )
2025-05-07T20:33:37.1593211Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1593302Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1593372Z         self,
2025-05-07T20:33:37.1593442Z         T: int,
2025-05-07T20:33:37.1593513Z         D: int,
2025-05-07T20:33:37.1593603Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1593687Z         contiguous: bool,
2025-05-07T20:33:37.1593773Z         compiled: bool,
2025-05-07T20:33:37.1593849Z     ) -> None:
2025-05-07T20:33:37.1593943Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1594018Z     
2025-05-07T20:33:37.1594226Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1594294Z     
2025-05-07T20:33:37.1594390Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1594507Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1594604Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1594678Z         x0 = x[:, :D]
2025-05-07T20:33:37.1594753Z         x1 = x[:, D:]
2025-05-07T20:33:37.1594863Z     
2025-05-07T20:33:37.1594941Z         if contiguous:
2025-05-07T20:33:37.1595025Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1595108Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1595173Z     
2025-05-07T20:33:37.1595256Z         if scale_ub is not None:
2025-05-07T20:33:37.1595359Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1595486Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1595558Z             )
2025-05-07T20:33:37.1595632Z         else:
2025-05-07T20:33:37.1595719Z             scale_ub_tensor = None
2025-05-07T20:33:37.1595788Z     
2025-05-07T20:33:37.1595917Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1596041Z             op = silu_mul_quant
2025-05-07T20:33:37.1596127Z             if compiled:
2025-05-07T20:33:37.1596221Z                 op = torch.compile(op)
2025-05-07T20:33:37.1596321Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1596390Z     
2025-05-07T20:33:37.1596478Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1596483Z 
2025-05-07T20:33:37.1596574Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1596701Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1596796Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1596892Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1597253Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:37.1597383Z     return fn(*args, **kwargs)
2025-05-07T20:33:37.1597877Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1597966Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1598317Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1598534Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1598869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1598958Z     kernel = self.compile(
2025-05-07T20:33:37.1599336Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1599502Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1599628Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1599633Z 
2025-05-07T20:33:37.1599832Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cb358ad0>
2025-05-07T20:33:37.1600596Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1601090Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca6f5300>}
2025-05-07T20:33:37.1601820Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1602003Z context = <triton._C.libtriton.ir.context object at 0x7f89ca6814b0>
2025-05-07T20:33:37.1602010Z 
2025-05-07T20:33:37.1602164Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1602465Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1602568Z                            module_map=module_map)
2025-05-07T20:33:37.1602722Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1602816Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1602925Z E       ^
2025-05-07T20:33:37.1603271Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1603281Z 
2025-05-07T20:33:37.1603711Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1603716Z 
2025-05-07T20:33:37.1603813Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1604036Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1604106Z     T=1,
2025-05-07T20:33:37.1604174Z     D=5120,
2025-05-07T20:33:37.1604266Z     scale_ub=1200.0,
2025-05-07T20:33:37.1604347Z     contiguous=False,
2025-05-07T20:33:37.1604463Z     compiled=False,
2025-05-07T20:33:37.1604535Z )
2025-05-07T20:33:37.1604747Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1604911Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:37.1604918Z 
2025-05-07T20:33:37.1604992Z     @given(
2025-05-07T20:33:37.1605105Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1605201Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1605310Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1605418Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1605526Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1605637Z     )
2025-05-07T20:33:37.1605878Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1605968Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1606037Z         self,
2025-05-07T20:33:37.1606112Z         T: int,
2025-05-07T20:33:37.1606181Z         D: int,
2025-05-07T20:33:37.1606272Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1606356Z         contiguous: bool,
2025-05-07T20:33:37.1606433Z         compiled: bool,
2025-05-07T20:33:37.1606505Z     ) -> None:
2025-05-07T20:33:37.1606596Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1606659Z     
2025-05-07T20:33:37.1606818Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1606888Z     
2025-05-07T20:33:37.1606973Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1607090Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1607175Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1607248Z         x0 = x[:, :D]
2025-05-07T20:33:37.1607324Z         x1 = x[:, D:]
2025-05-07T20:33:37.1607390Z     
2025-05-07T20:33:37.1607465Z         if contiguous:
2025-05-07T20:33:37.1607553Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1607633Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1607701Z     
2025-05-07T20:33:37.1607784Z         if scale_ub is not None:
2025-05-07T20:33:37.1607881Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1608008Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1608084Z             )
2025-05-07T20:33:37.1608153Z         else:
2025-05-07T20:33:37.1608239Z             scale_ub_tensor = None
2025-05-07T20:33:37.1608308Z     
2025-05-07T20:33:37.1608430Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1608515Z             op = silu_mul_quant
2025-05-07T20:33:37.1608595Z             if compiled:
2025-05-07T20:33:37.1608688Z                 op = torch.compile(op)
2025-05-07T20:33:37.1608788Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1608857Z     
2025-05-07T20:33:37.1608941Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1608945Z 
2025-05-07T20:33:37.1609084Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1609209Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1609302Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1609395Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1609879Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1610011Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1610362Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1610575Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1610912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1611003Z     kernel = self.compile(
2025-05-07T20:33:37.1611383Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1611614Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1611733Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1611738Z 
2025-05-07T20:33:37.1611935Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cb25f750>
2025-05-07T20:33:37.1612696Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1613186Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca6f6020>}
2025-05-07T20:33:37.1613961Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1614144Z context = <triton._C.libtriton.ir.context object at 0x7f89ca6f17f0>
2025-05-07T20:33:37.1614149Z 
2025-05-07T20:33:37.1614306Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1614559Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1614664Z                            module_map=module_map)
2025-05-07T20:33:37.1614817Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1614908Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1614982Z E       ^
2025-05-07T20:33:37.1615326Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1615333Z 
2025-05-07T20:33:37.1615743Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1615751Z 
2025-05-07T20:33:37.1615851Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1616065Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1616138Z     T=16384,
2025-05-07T20:33:37.1616211Z     D=5120,
2025-05-07T20:33:37.1616286Z     scale_ub=1200.0,
2025-05-07T20:33:37.1616373Z     contiguous=False,
2025-05-07T20:33:37.1616448Z     compiled=True,
2025-05-07T20:33:37.1616517Z )
2025-05-07T20:33:37.1616728Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1616899Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:37.1616903Z 
2025-05-07T20:33:37.1616976Z     @given(
2025-05-07T20:33:37.1617090Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1617187Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1617300Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1617451Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1617562Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1617636Z     )
2025-05-07T20:33:37.1617872Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1617959Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1618073Z         self,
2025-05-07T20:33:37.1618142Z         T: int,
2025-05-07T20:33:37.1618214Z         D: int,
2025-05-07T20:33:37.1618304Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1618385Z         contiguous: bool,
2025-05-07T20:33:37.1618467Z         compiled: bool,
2025-05-07T20:33:37.1618539Z     ) -> None:
2025-05-07T20:33:37.1618627Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1618698Z     
2025-05-07T20:33:37.1618862Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1618928Z     
2025-05-07T20:33:37.1619016Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1619134Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1619256Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1619332Z         x0 = x[:, :D]
2025-05-07T20:33:37.1619403Z         x1 = x[:, D:]
2025-05-07T20:33:37.1619468Z     
2025-05-07T20:33:37.1619547Z         if contiguous:
2025-05-07T20:33:37.1619631Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1619715Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1619782Z     
2025-05-07T20:33:37.1619863Z         if scale_ub is not None:
2025-05-07T20:33:37.1619967Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1620093Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1620160Z             )
2025-05-07T20:33:37.1620232Z         else:
2025-05-07T20:33:37.1620318Z             scale_ub_tensor = None
2025-05-07T20:33:37.1620425Z     
2025-05-07T20:33:37.1620550Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1620631Z             op = silu_mul_quant
2025-05-07T20:33:37.1620714Z             if compiled:
2025-05-07T20:33:37.1620814Z                 op = torch.compile(op)
2025-05-07T20:33:37.1620914Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1620982Z     
2025-05-07T20:33:37.1621070Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1621074Z 
2025-05-07T20:33:37.1621163Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1621289Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1621382Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1621472Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1621834Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:37.1621922Z     return fn(*args, **kwargs)
2025-05-07T20:33:37.1622409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1622504Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1622856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1623074Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1623408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1623498Z     kernel = self.compile(
2025-05-07T20:33:37.1623893Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1624060Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1624181Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1624188Z 
2025-05-07T20:33:37.1624383Z self = <triton.compiler.compiler.ASTSource object at 0x7f8819f227d0>
2025-05-07T20:33:37.1625186Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1625679Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca6f7600>}
2025-05-07T20:33:37.1626445Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1626634Z context = <triton._C.libtriton.ir.context object at 0x7f89ca0a5670>
2025-05-07T20:33:37.1626638Z 
2025-05-07T20:33:37.1626791Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1627046Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1627152Z                            module_map=module_map)
2025-05-07T20:33:37.1627345Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1627533Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1627606Z E       ^
2025-05-07T20:33:37.1627951Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1627958Z 
2025-05-07T20:33:37.1628389Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1628393Z 
2025-05-07T20:33:37.1628489Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1628704Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1628775Z     T=2048,
2025-05-07T20:33:37.1628891Z     D=7168,
2025-05-07T20:33:37.1628971Z     scale_ub=1200.0,
2025-05-07T20:33:37.1629050Z     contiguous=False,
2025-05-07T20:33:37.1629125Z     compiled=True,
2025-05-07T20:33:37.1629195Z )
2025-05-07T20:33:37.1629408Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1629573Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:37.1629578Z 
2025-05-07T20:33:37.1629647Z     @given(
2025-05-07T20:33:37.1629759Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1629858Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1629966Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1630075Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1630183Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1630252Z     )
2025-05-07T20:33:37.1630486Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1630577Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1630649Z         self,
2025-05-07T20:33:37.1630719Z         T: int,
2025-05-07T20:33:37.1630793Z         D: int,
2025-05-07T20:33:37.1630882Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1630966Z         contiguous: bool,
2025-05-07T20:33:37.1631047Z         compiled: bool,
2025-05-07T20:33:37.1631119Z     ) -> None:
2025-05-07T20:33:37.1631207Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1631271Z     
2025-05-07T20:33:37.1631432Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1631505Z     
2025-05-07T20:33:37.1631589Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1631705Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1631788Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1631861Z         x0 = x[:, :D]
2025-05-07T20:33:37.1631933Z         x1 = x[:, D:]
2025-05-07T20:33:37.1632002Z     
2025-05-07T20:33:37.1632080Z         if contiguous:
2025-05-07T20:33:37.1632163Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1632247Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1632312Z     
2025-05-07T20:33:37.1632440Z         if scale_ub is not None:
2025-05-07T20:33:37.1632542Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1632670Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1632742Z             )
2025-05-07T20:33:37.1632812Z         else:
2025-05-07T20:33:37.1632899Z             scale_ub_tensor = None
2025-05-07T20:33:37.1633022Z     
2025-05-07T20:33:37.1633145Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1633229Z             op = silu_mul_quant
2025-05-07T20:33:37.1633310Z             if compiled:
2025-05-07T20:33:37.1633402Z                 op = torch.compile(op)
2025-05-07T20:33:37.1633500Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1633566Z     
2025-05-07T20:33:37.1633647Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1633654Z 
2025-05-07T20:33:37.1633747Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1633872Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1633964Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1634098Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1634460Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:37.1634546Z     return fn(*args, **kwargs)
2025-05-07T20:33:37.1635035Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1635125Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1635478Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1635692Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1636064Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1636154Z     kernel = self.compile(
2025-05-07T20:33:37.1636555Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1636719Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1636841Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1636848Z 
2025-05-07T20:33:37.1637043Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cb3c4dd0>
2025-05-07T20:33:37.1637806Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1638294Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca038720>}
2025-05-07T20:33:37.1639033Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1639215Z context = <triton._C.libtriton.ir.context object at 0x7f89ca054bb0>
2025-05-07T20:33:37.1639220Z 
2025-05-07T20:33:37.1639374Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1639632Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1639732Z                            module_map=module_map)
2025-05-07T20:33:37.1639885Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1639980Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1640051Z E       ^
2025-05-07T20:33:37.1640684Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1640690Z 
2025-05-07T20:33:37.1641193Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1641198Z 
2025-05-07T20:33:37.1641295Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1641512Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1641581Z     T=1,
2025-05-07T20:33:37.1641736Z     D=5120,
2025-05-07T20:33:37.1641811Z     scale_ub=None,
2025-05-07T20:33:37.1641892Z     contiguous=False,
2025-05-07T20:33:37.1641971Z     compiled=False,
2025-05-07T20:33:37.1642038Z )
2025-05-07T20:33:37.1642247Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1642408Z T = 1, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:37.1642412Z 
2025-05-07T20:33:37.1642487Z     @given(
2025-05-07T20:33:37.1642598Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1642692Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1642803Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1642973Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1643079Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1643149Z     )
2025-05-07T20:33:37.1643387Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1643476Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1643547Z         self,
2025-05-07T20:33:37.1643619Z         T: int,
2025-05-07T20:33:37.1643689Z         D: int,
2025-05-07T20:33:37.1643779Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1643863Z         contiguous: bool,
2025-05-07T20:33:37.1643941Z         compiled: bool,
2025-05-07T20:33:37.1644011Z     ) -> None:
2025-05-07T20:33:37.1644109Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1644241Z     
2025-05-07T20:33:37.1644405Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1644472Z     
2025-05-07T20:33:37.1644559Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1644685Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1644770Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1644845Z         x0 = x[:, :D]
2025-05-07T20:33:37.1644919Z         x1 = x[:, D:]
2025-05-07T20:33:37.1644986Z     
2025-05-07T20:33:37.1645062Z         if contiguous:
2025-05-07T20:33:37.1645152Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1645236Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1645306Z     
2025-05-07T20:33:37.1645397Z         if scale_ub is not None:
2025-05-07T20:33:37.1645497Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1645625Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1645700Z             )
2025-05-07T20:33:37.1645777Z         else:
2025-05-07T20:33:37.1645869Z             scale_ub_tensor = None
2025-05-07T20:33:37.1645936Z     
2025-05-07T20:33:37.1646060Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1646147Z             op = silu_mul_quant
2025-05-07T20:33:37.1646228Z             if compiled:
2025-05-07T20:33:37.1646321Z                 op = torch.compile(op)
2025-05-07T20:33:37.1646423Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1646490Z     
2025-05-07T20:33:37.1646574Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1646581Z 
2025-05-07T20:33:37.1646673Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1646793Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1646891Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1646988Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1647480Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1647580Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1647980Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1648201Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1648539Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1648628Z     kernel = self.compile(
2025-05-07T20:33:37.1649070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1649238Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1649359Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1649364Z 
2025-05-07T20:33:37.1649565Z self = <triton.compiler.compiler.ASTSource object at 0x7f89ca7181d0>
2025-05-07T20:33:37.1650359Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1650915Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca039120>}
2025-05-07T20:33:37.1651650Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1651836Z context = <triton._C.libtriton.ir.context object at 0x7f89caaf0170>
2025-05-07T20:33:37.1651845Z 
2025-05-07T20:33:37.1652001Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1652255Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1652401Z                            module_map=module_map)
2025-05-07T20:33:37.1652561Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1652653Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1652734Z E       ^
2025-05-07T20:33:37.1653080Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1653084Z 
2025-05-07T20:33:37.1653496Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1653503Z 
2025-05-07T20:33:37.1653600Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1653815Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1653886Z     T=4096,
2025-05-07T20:33:37.1653958Z     D=7168,
2025-05-07T20:33:37.1654036Z     scale_ub=1200.0,
2025-05-07T20:33:37.1654120Z     contiguous=False,
2025-05-07T20:33:37.1654203Z     compiled=False,
2025-05-07T20:33:37.1654269Z )
2025-05-07T20:33:37.1654487Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1654662Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:37.1654666Z 
2025-05-07T20:33:37.1654737Z     @given(
2025-05-07T20:33:37.1654848Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1654942Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1655057Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1655169Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1655278Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1655350Z     )
2025-05-07T20:33:37.1655586Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1655676Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1655745Z         self,
2025-05-07T20:33:37.1655815Z         T: int,
2025-05-07T20:33:37.1655889Z         D: int,
2025-05-07T20:33:37.1655978Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1656107Z         contiguous: bool,
2025-05-07T20:33:37.1656188Z         compiled: bool,
2025-05-07T20:33:37.1656266Z     ) -> None:
2025-05-07T20:33:37.1656352Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1656422Z     
2025-05-07T20:33:37.1656582Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1656647Z     
2025-05-07T20:33:37.1656733Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1656890Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1656970Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1657047Z         x0 = x[:, :D]
2025-05-07T20:33:37.1657120Z         x1 = x[:, D:]
2025-05-07T20:33:37.1657190Z     
2025-05-07T20:33:37.1657263Z         if contiguous:
2025-05-07T20:33:37.1657346Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1657434Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1657501Z     
2025-05-07T20:33:37.1657585Z         if scale_ub is not None:
2025-05-07T20:33:37.1657687Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1657818Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1657928Z             )
2025-05-07T20:33:37.1658005Z         else:
2025-05-07T20:33:37.1658091Z             scale_ub_tensor = None
2025-05-07T20:33:37.1658156Z     
2025-05-07T20:33:37.1658283Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1658367Z             op = silu_mul_quant
2025-05-07T20:33:37.1658448Z             if compiled:
2025-05-07T20:33:37.1658539Z                 op = torch.compile(op)
2025-05-07T20:33:37.1658639Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1658712Z     
2025-05-07T20:33:37.1658796Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1658800Z 
2025-05-07T20:33:37.1658896Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1659019Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1659158Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1659253Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1659749Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1659841Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1660199Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1660415Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1660751Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1660842Z     kernel = self.compile(
2025-05-07T20:33:37.1661219Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1661387Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1661513Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1661518Z 
2025-05-07T20:33:37.1661716Z self = <triton.compiler.compiler.ASTSource object at 0x7f8819f23150>
2025-05-07T20:33:37.1662480Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1662971Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca03a480>}
2025-05-07T20:33:37.1663702Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1663889Z context = <triton._C.libtriton.ir.context object at 0x7f89caa034b0>
2025-05-07T20:33:37.1663894Z 
2025-05-07T20:33:37.1664087Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1664349Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1664450Z                            module_map=module_map)
2025-05-07T20:33:37.1664611Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1664740Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1664809Z E       ^
2025-05-07T20:33:37.1665155Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1665159Z 
2025-05-07T20:33:37.1665587Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1665595Z 
2025-05-07T20:33:37.1665695Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1665908Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1665982Z     T=16384,
2025-05-07T20:33:37.1666055Z     D=7168,
2025-05-07T20:33:37.1666172Z     scale_ub=None,
2025-05-07T20:33:37.1666252Z     contiguous=True,
2025-05-07T20:33:37.1666332Z     compiled=True,
2025-05-07T20:33:37.1666399Z )
2025-05-07T20:33:37.1666610Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1666782Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:37.1666786Z 
2025-05-07T20:33:37.1666858Z     @given(
2025-05-07T20:33:37.1666970Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1667066Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1667172Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1667286Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1667494Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1667565Z     )
2025-05-07T20:33:37.1667806Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1667896Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1667966Z         self,
2025-05-07T20:33:37.1668042Z         T: int,
2025-05-07T20:33:37.1668114Z         D: int,
2025-05-07T20:33:37.1668206Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1668291Z         contiguous: bool,
2025-05-07T20:33:37.1668374Z         compiled: bool,
2025-05-07T20:33:37.1668450Z     ) -> None:
2025-05-07T20:33:37.1668538Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1668604Z     
2025-05-07T20:33:37.1668769Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1668837Z     
2025-05-07T20:33:37.1668922Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1669043Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1669128Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1669204Z         x0 = x[:, :D]
2025-05-07T20:33:37.1669279Z         x1 = x[:, D:]
2025-05-07T20:33:37.1669344Z     
2025-05-07T20:33:37.1669424Z         if contiguous:
2025-05-07T20:33:37.1669516Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1669598Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1669662Z     
2025-05-07T20:33:37.1669746Z         if scale_ub is not None:
2025-05-07T20:33:37.1669845Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1669973Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1670046Z             )
2025-05-07T20:33:37.1670117Z         else:
2025-05-07T20:33:37.1670211Z             scale_ub_tensor = None
2025-05-07T20:33:37.1670279Z     
2025-05-07T20:33:37.1670404Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1670496Z             op = silu_mul_quant
2025-05-07T20:33:37.1670576Z             if compiled:
2025-05-07T20:33:37.1670672Z                 op = torch.compile(op)
2025-05-07T20:33:37.1670778Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1670844Z     
2025-05-07T20:33:37.1670999Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1671008Z 
2025-05-07T20:33:37.1671103Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1671224Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1671318Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1671411Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1671815Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:37.1671909Z     return fn(*args, **kwargs)
2025-05-07T20:33:37.1672393Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1672484Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1672841Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1673062Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1673438Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1673527Z     kernel = self.compile(
2025-05-07T20:33:37.1673925Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1674102Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1674221Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1674226Z 
2025-05-07T20:33:37.1674425Z self = <triton.compiler.compiler.ASTSource object at 0x7f89ca718450>
2025-05-07T20:33:37.1675187Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1675724Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca03b740>}
2025-05-07T20:33:37.1676460Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1676647Z context = <triton._C.libtriton.ir.context object at 0x7f89caa33f70>
2025-05-07T20:33:37.1676652Z 
2025-05-07T20:33:37.1676814Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1677066Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1677168Z                            module_map=module_map)
2025-05-07T20:33:37.1677332Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1677424Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1677501Z E       ^
2025-05-07T20:33:37.1677852Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1677856Z 
2025-05-07T20:33:37.1678265Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1678269Z 
2025-05-07T20:33:37.1678371Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1678585Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1678656Z     T=4096,
2025-05-07T20:33:37.1678728Z     D=5120,
2025-05-07T20:33:37.1678807Z     scale_ub=None,
2025-05-07T20:33:37.1678894Z     contiguous=False,
2025-05-07T20:33:37.1678971Z     compiled=True,
2025-05-07T20:33:37.1679036Z )
2025-05-07T20:33:37.1679250Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1679418Z T = 4096, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:37.1679465Z 
2025-05-07T20:33:37.1679536Z     @given(
2025-05-07T20:33:37.1679658Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1679752Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1679863Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1679973Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1680121Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1680192Z     )
2025-05-07T20:33:37.1680428Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1680516Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1680591Z         self,
2025-05-07T20:33:37.1680661Z         T: int,
2025-05-07T20:33:37.1680733Z         D: int,
2025-05-07T20:33:37.1680828Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1680914Z         contiguous: bool,
2025-05-07T20:33:37.1680991Z         compiled: bool,
2025-05-07T20:33:37.1681068Z     ) -> None:
2025-05-07T20:33:37.1681156Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1681220Z     
2025-05-07T20:33:37.1681425Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1681494Z     
2025-05-07T20:33:37.1681584Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1681704Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1681788Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1681866Z         x0 = x[:, :D]
2025-05-07T20:33:37.1681938Z         x1 = x[:, D:]
2025-05-07T20:33:37.1682008Z     
2025-05-07T20:33:37.1682089Z         if contiguous:
2025-05-07T20:33:37.1682176Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1682261Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1682334Z     
2025-05-07T20:33:37.1682417Z         if scale_ub is not None:
2025-05-07T20:33:37.1682561Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1682692Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1682761Z             )
2025-05-07T20:33:37.1682832Z         else:
2025-05-07T20:33:37.1682921Z             scale_ub_tensor = None
2025-05-07T20:33:37.1682990Z     
2025-05-07T20:33:37.1683118Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1683200Z             op = silu_mul_quant
2025-05-07T20:33:37.1683280Z             if compiled:
2025-05-07T20:33:37.1683377Z                 op = torch.compile(op)
2025-05-07T20:33:37.1683481Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1683549Z     
2025-05-07T20:33:37.1683641Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1683645Z 
2025-05-07T20:33:37.1683737Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1683863Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1683956Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1684051Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1684419Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:37.1684511Z     return fn(*args, **kwargs)
2025-05-07T20:33:37.1685000Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1685098Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1685449Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1685671Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1686003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1686091Z     kernel = self.compile(
2025-05-07T20:33:37.1686472Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1686639Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1686802Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1686808Z 
2025-05-07T20:33:37.1687008Z self = <triton.compiler.compiler.ASTSource object at 0x7f89ca49b4d0>
2025-05-07T20:33:37.1687769Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1688301Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca254c20>}
2025-05-07T20:33:37.1689031Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1689222Z context = <triton._C.libtriton.ir.context object at 0x7f89ca2c77f0>
2025-05-07T20:33:37.1689227Z 
2025-05-07T20:33:37.1689417Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1689674Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1689779Z                            module_map=module_map)
2025-05-07T20:33:37.1689936Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1690026Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1690100Z E       ^
2025-05-07T20:33:37.1690446Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1690451Z 
2025-05-07T20:33:37.1690860Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1690903Z 
2025-05-07T20:33:37.1690997Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1691213Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1691289Z     T=4096,
2025-05-07T20:33:37.1691361Z     D=5120,
2025-05-07T20:33:37.1691440Z     scale_ub=1200.0,
2025-05-07T20:33:37.1691520Z     contiguous=False,
2025-05-07T20:33:37.1691596Z     compiled=False,
2025-05-07T20:33:37.1691663Z )
2025-05-07T20:33:37.1691873Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1692045Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:37.1692049Z 
2025-05-07T20:33:37.1692123Z     @given(
2025-05-07T20:33:37.1692234Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1692324Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1692438Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1692552Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1692661Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1692729Z     )
2025-05-07T20:33:37.1692969Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1693060Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1693129Z         self,
2025-05-07T20:33:37.1693200Z         T: int,
2025-05-07T20:33:37.1693272Z         D: int,
2025-05-07T20:33:37.1693360Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1693446Z         contiguous: bool,
2025-05-07T20:33:37.1693525Z         compiled: bool,
2025-05-07T20:33:37.1693595Z     ) -> None:
2025-05-07T20:33:37.1693684Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1693750Z     
2025-05-07T20:33:37.1693910Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1693976Z     
2025-05-07T20:33:37.1694064Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1694183Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1694268Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1694342Z         x0 = x[:, :D]
2025-05-07T20:33:37.1694457Z         x1 = x[:, D:]
2025-05-07T20:33:37.1694530Z     
2025-05-07T20:33:37.1694609Z         if contiguous:
2025-05-07T20:33:37.1694694Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1694781Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1694849Z     
2025-05-07T20:33:37.1694930Z         if scale_ub is not None:
2025-05-07T20:33:37.1695031Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1695198Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1695266Z             )
2025-05-07T20:33:37.1695343Z         else:
2025-05-07T20:33:37.1695431Z             scale_ub_tensor = None
2025-05-07T20:33:37.1695499Z     
2025-05-07T20:33:37.1695623Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1695708Z             op = silu_mul_quant
2025-05-07T20:33:37.1695797Z             if compiled:
2025-05-07T20:33:37.1695887Z                 op = torch.compile(op)
2025-05-07T20:33:37.1695988Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1696055Z     
2025-05-07T20:33:37.1696180Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1696184Z 
2025-05-07T20:33:37.1696275Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1696403Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1696497Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1696597Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1697085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1697175Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1697533Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1697815Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1698153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1698243Z     kernel = self.compile(
2025-05-07T20:33:37.1698625Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1698796Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1698917Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1698924Z 
2025-05-07T20:33:37.1699119Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cafbd850>
2025-05-07T20:33:37.1699881Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1700376Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca2556c0>}
2025-05-07T20:33:37.1701111Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1701296Z context = <triton._C.libtriton.ir.context object at 0x7f89ca2fc270>
2025-05-07T20:33:37.1701303Z 
2025-05-07T20:33:37.1701464Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1701719Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1705119Z                            module_map=module_map)
2025-05-07T20:33:37.1705312Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1705414Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1705494Z E       ^
2025-05-07T20:33:37.1705985Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1705990Z 
2025-05-07T20:33:37.1706409Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1706413Z 
2025-05-07T20:33:37.1706514Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1706733Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1706846Z     T=4096,
2025-05-07T20:33:37.1706920Z     D=5120,
2025-05-07T20:33:37.1706999Z     scale_ub=1200.0,
2025-05-07T20:33:37.1707078Z     contiguous=False,
2025-05-07T20:33:37.1707160Z     compiled=True,
2025-05-07T20:33:37.1707227Z )
2025-05-07T20:33:37.1707501Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1707675Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:37.1707682Z 
2025-05-07T20:33:37.1707756Z     @given(
2025-05-07T20:33:37.1707875Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1707969Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1708126Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1708243Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1708349Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1708415Z     )
2025-05-07T20:33:37.1708661Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1708749Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1708818Z         self,
2025-05-07T20:33:37.1708891Z         T: int,
2025-05-07T20:33:37.1708963Z         D: int,
2025-05-07T20:33:37.1709057Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1709140Z         contiguous: bool,
2025-05-07T20:33:37.1709218Z         compiled: bool,
2025-05-07T20:33:37.1709338Z     ) -> None:
2025-05-07T20:33:37.1709429Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1709499Z     
2025-05-07T20:33:37.1709669Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1709737Z     
2025-05-07T20:33:37.1709828Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1709953Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1710039Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1710116Z         x0 = x[:, :D]
2025-05-07T20:33:37.1710197Z         x1 = x[:, D:]
2025-05-07T20:33:37.1710268Z     
2025-05-07T20:33:37.1710347Z         if contiguous:
2025-05-07T20:33:37.1710435Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1710517Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1710591Z     
2025-05-07T20:33:37.1710675Z         if scale_ub is not None:
2025-05-07T20:33:37.1710775Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1710908Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1710985Z             )
2025-05-07T20:33:37.1711056Z         else:
2025-05-07T20:33:37.1711146Z             scale_ub_tensor = None
2025-05-07T20:33:37.1711215Z     
2025-05-07T20:33:37.1711339Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1711428Z             op = silu_mul_quant
2025-05-07T20:33:37.1711510Z             if compiled:
2025-05-07T20:33:37.1711605Z                 op = torch.compile(op)
2025-05-07T20:33:37.1711708Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1711779Z     
2025-05-07T20:33:37.1711869Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1711874Z 
2025-05-07T20:33:37.1711966Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1712091Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1712190Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1712287Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1712653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:37.1712747Z     return fn(*args, **kwargs)
2025-05-07T20:33:37.1713280Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1713380Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1713734Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1713950Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1714899Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1714992Z     kernel = self.compile(
2025-05-07T20:33:37.1715371Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1715545Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1715671Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1715676Z 
2025-05-07T20:33:37.1715879Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cbbe0650>
2025-05-07T20:33:37.1716682Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1717183Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca256fc0>}
2025-05-07T20:33:37.1717914Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1718136Z context = <triton._C.libtriton.ir.context object at 0x7f8819f38eb0>
2025-05-07T20:33:37.1718141Z 
2025-05-07T20:33:37.1718307Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1718566Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1718671Z                            module_map=module_map)
2025-05-07T20:33:37.1718826Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1718920Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1718999Z E       ^
2025-05-07T20:33:37.1719346Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1719351Z 
2025-05-07T20:33:37.1719761Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1719769Z 
2025-05-07T20:33:37.1719870Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1720090Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1720176Z     T=2048,
2025-05-07T20:33:37.1720272Z     D=7168,
2025-05-07T20:33:37.1720356Z     scale_ub=1200.0,
2025-05-07T20:33:37.1720464Z     contiguous=False,
2025-05-07T20:33:37.1720543Z     compiled=False,
2025-05-07T20:33:37.1720612Z )
2025-05-07T20:33:37.1720827Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1720997Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:37.1721005Z 
2025-05-07T20:33:37.1721081Z     @given(
2025-05-07T20:33:37.1721196Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1721290Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1721400Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1721510Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1721618Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1721690Z     )
2025-05-07T20:33:37.1721930Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1722064Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1722143Z         self,
2025-05-07T20:33:37.1722221Z         T: int,
2025-05-07T20:33:37.1722291Z         D: int,
2025-05-07T20:33:37.1722384Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1722469Z         contiguous: bool,
2025-05-07T20:33:37.1722553Z         compiled: bool,
2025-05-07T20:33:37.1722670Z     ) -> None:
2025-05-07T20:33:37.1722761Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1722830Z     
2025-05-07T20:33:37.1722994Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1723062Z     
2025-05-07T20:33:37.1723153Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1723270Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1723354Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1723442Z         x0 = x[:, :D]
2025-05-07T20:33:37.1723518Z         x1 = x[:, D:]
2025-05-07T20:33:37.1723587Z     
2025-05-07T20:33:37.1723673Z         if contiguous:
2025-05-07T20:33:37.1723761Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1723886Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1723959Z     
2025-05-07T20:33:37.1724045Z         if scale_ub is not None:
2025-05-07T20:33:37.1724150Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1724279Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1724356Z             )
2025-05-07T20:33:37.1724433Z         else:
2025-05-07T20:33:37.1724522Z             scale_ub_tensor = None
2025-05-07T20:33:37.1724592Z     
2025-05-07T20:33:37.1724720Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1724804Z             op = silu_mul_quant
2025-05-07T20:33:37.1724886Z             if compiled:
2025-05-07T20:33:37.1724984Z                 op = torch.compile(op)
2025-05-07T20:33:37.1725130Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1725200Z     
2025-05-07T20:33:37.1725291Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1725298Z 
2025-05-07T20:33:37.1725389Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1725518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1725613Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1725708Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1726204Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1726298Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1726653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1726871Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1727208Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1727304Z     kernel = self.compile(
2025-05-07T20:33:37.1727701Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1727869Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1727992Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1727997Z 
2025-05-07T20:33:37.1728197Z self = <triton.compiler.compiler.ASTSource object at 0x7f89ca498350>
2025-05-07T20:33:37.1728961Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1729453Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca257ec0>}
2025-05-07T20:33:37.1730546Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1730729Z context = <triton._C.libtriton.ir.context object at 0x7f8819f7fcb0>
2025-05-07T20:33:37.1730734Z 
2025-05-07T20:33:37.1730891Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1731188Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1731292Z                            module_map=module_map)
2025-05-07T20:33:37.1731448Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1731544Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1731618Z E       ^
2025-05-07T20:33:37.1731966Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1731974Z 
2025-05-07T20:33:37.1732404Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1732471Z 
2025-05-07T20:33:37.1732570Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1732788Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1732862Z     T=1,
2025-05-07T20:33:37.1732938Z     D=7168,
2025-05-07T20:33:37.1733019Z     scale_ub=None,
2025-05-07T20:33:37.1733098Z     contiguous=True,
2025-05-07T20:33:37.1733184Z     compiled=False,
2025-05-07T20:33:37.1733253Z )
2025-05-07T20:33:37.1733465Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1733626Z T = 1, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:37.1733631Z 
2025-05-07T20:33:37.1733705Z     @given(
2025-05-07T20:33:37.1733862Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1733959Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1734071Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1734189Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1734297Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1734368Z     )
2025-05-07T20:33:37.1734606Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1734693Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1734771Z         self,
2025-05-07T20:33:37.1734849Z         T: int,
2025-05-07T20:33:37.1734923Z         D: int,
2025-05-07T20:33:37.1735015Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1735104Z         contiguous: bool,
2025-05-07T20:33:37.1735184Z         compiled: bool,
2025-05-07T20:33:37.1735257Z     ) -> None:
2025-05-07T20:33:37.1735351Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1735421Z     
2025-05-07T20:33:37.1735591Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1735661Z     
2025-05-07T20:33:37.1735748Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1735876Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1735965Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1736043Z         x0 = x[:, :D]
2025-05-07T20:33:37.1736123Z         x1 = x[:, D:]
2025-05-07T20:33:37.1736191Z     
2025-05-07T20:33:37.1736270Z         if contiguous:
2025-05-07T20:33:37.1736366Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1736451Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1736521Z     
2025-05-07T20:33:37.1736608Z         if scale_ub is not None:
2025-05-07T20:33:37.1736709Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1736838Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1736912Z             )
2025-05-07T20:33:37.1736984Z         else:
2025-05-07T20:33:37.1737079Z             scale_ub_tensor = None
2025-05-07T20:33:37.1737150Z     
2025-05-07T20:33:37.1737275Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1737411Z             op = silu_mul_quant
2025-05-07T20:33:37.1737492Z             if compiled:
2025-05-07T20:33:37.1737589Z                 op = torch.compile(op)
2025-05-07T20:33:37.1737693Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1737761Z     
2025-05-07T20:33:37.1737847Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1737852Z 
2025-05-07T20:33:37.1737986Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1738110Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1738208Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1738302Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1738794Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1738893Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1739247Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1739507Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1739849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1739940Z     kernel = self.compile(
2025-05-07T20:33:37.1740577Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1740757Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1740878Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1740883Z 
2025-05-07T20:33:37.1741081Z self = <triton.compiler.compiler.ASTSource object at 0x7f8819fadc50>
2025-05-07T20:33:37.1741846Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1742438Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8819f10cc0>}
2025-05-07T20:33:37.1743170Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1743357Z context = <triton._C.libtriton.ir.context object at 0x7f8819f731f0>
2025-05-07T20:33:37.1743366Z 
2025-05-07T20:33:37.1743521Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1743777Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1743889Z                            module_map=module_map)
2025-05-07T20:33:37.1744045Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1744140Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1744217Z E       ^
2025-05-07T20:33:37.1744566Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1744570Z 
2025-05-07T20:33:37.1744987Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1744994Z 
2025-05-07T20:33:37.1745089Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1745304Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1745381Z     T=16384,
2025-05-07T20:33:37.1745455Z     D=7168,
2025-05-07T20:33:37.1745533Z     scale_ub=1200.0,
2025-05-07T20:33:37.1745616Z     contiguous=False,
2025-05-07T20:33:37.1745696Z     compiled=True,
2025-05-07T20:33:37.1745764Z )
2025-05-07T20:33:37.1745977Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1746229Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:37.1746234Z 
2025-05-07T20:33:37.1746313Z     @given(
2025-05-07T20:33:37.1746428Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1746521Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1746634Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1746805Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1746914Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1746988Z     )
2025-05-07T20:33:37.1747223Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1747314Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1747388Z         self,
2025-05-07T20:33:37.1747513Z         T: int,
2025-05-07T20:33:37.1747593Z         D: int,
2025-05-07T20:33:37.1747686Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1747772Z         contiguous: bool,
2025-05-07T20:33:37.1747858Z         compiled: bool,
2025-05-07T20:33:37.1747934Z     ) -> None:
2025-05-07T20:33:37.1748086Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1748157Z     
2025-05-07T20:33:37.1748320Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1748390Z     
2025-05-07T20:33:37.1748478Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1748599Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1748681Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1748758Z         x0 = x[:, :D]
2025-05-07T20:33:37.1748831Z         x1 = x[:, D:]
2025-05-07T20:33:37.1748902Z     
2025-05-07T20:33:37.1748978Z         if contiguous:
2025-05-07T20:33:37.1749064Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1749149Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1749260Z     
2025-05-07T20:33:37.1749344Z         if scale_ub is not None:
2025-05-07T20:33:37.1749448Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1749580Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1749652Z             )
2025-05-07T20:33:37.1749730Z         else:
2025-05-07T20:33:37.1749820Z             scale_ub_tensor = None
2025-05-07T20:33:37.1749888Z     
2025-05-07T20:33:37.1750015Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1750100Z             op = silu_mul_quant
2025-05-07T20:33:37.1750187Z             if compiled:
2025-05-07T20:33:37.1750300Z                 op = torch.compile(op)
2025-05-07T20:33:37.1750408Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1750498Z     
2025-05-07T20:33:37.1750588Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1750592Z 
2025-05-07T20:33:37.1750683Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1750808Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1750907Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1751002Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1751373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:37.1751462Z     return fn(*args, **kwargs)
2025-05-07T20:33:37.1751951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1752042Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1752396Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1752617Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1752951Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1753038Z     kernel = self.compile(
2025-05-07T20:33:37.1753421Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1753634Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1753761Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1753766Z 
2025-05-07T20:33:37.1753962Z self = <triton.compiler.compiler.ASTSource object at 0x7f89ca1e4ad0>
2025-05-07T20:33:37.1754723Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1755256Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8819f120c0>}
2025-05-07T20:33:37.1755990Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1756218Z context = <triton._C.libtriton.ir.context object at 0x7f89cb71b470>
2025-05-07T20:33:37.1756223Z 
2025-05-07T20:33:37.1756381Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1756637Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1756742Z                            module_map=module_map)
2025-05-07T20:33:37.1756898Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1756993Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1757067Z E       ^
2025-05-07T20:33:37.1757412Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1757417Z 
2025-05-07T20:33:37.1757891Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1757896Z 
2025-05-07T20:33:37.1757996Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1758216Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1758290Z     T=1,
2025-05-07T20:33:37.1758363Z     D=7168,
2025-05-07T20:33:37.1758440Z     scale_ub=None,
2025-05-07T20:33:37.1758523Z     contiguous=False,
2025-05-07T20:33:37.1758601Z     compiled=False,
2025-05-07T20:33:37.1758674Z )
2025-05-07T20:33:37.1758886Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1759047Z T = 1, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:37.1759055Z 
2025-05-07T20:33:37.1759126Z     @given(
2025-05-07T20:33:37.1759240Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1759337Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1759447Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1759556Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1759670Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1759738Z     )
2025-05-07T20:33:37.1759979Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1760072Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1760144Z         self,
2025-05-07T20:33:37.1760217Z         T: int,
2025-05-07T20:33:37.1760292Z         D: int,
2025-05-07T20:33:37.1760386Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1760472Z         contiguous: bool,
2025-05-07T20:33:37.1760551Z         compiled: bool,
2025-05-07T20:33:37.1760626Z     ) -> None:
2025-05-07T20:33:37.1760716Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1760785Z     
2025-05-07T20:33:37.1760947Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1761019Z     
2025-05-07T20:33:37.1761106Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1761224Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1761357Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1761431Z         x0 = x[:, :D]
2025-05-07T20:33:37.1761506Z         x1 = x[:, D:]
2025-05-07T20:33:37.1761574Z     
2025-05-07T20:33:37.1761653Z         if contiguous:
2025-05-07T20:33:37.1761744Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1761829Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1761898Z     
2025-05-07T20:33:37.1762064Z         if scale_ub is not None:
2025-05-07T20:33:37.1762163Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1762292Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1762366Z             )
2025-05-07T20:33:37.1762439Z         else:
2025-05-07T20:33:37.1762530Z             scale_ub_tensor = None
2025-05-07T20:33:37.1762604Z     
2025-05-07T20:33:37.1762727Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1762813Z             op = silu_mul_quant
2025-05-07T20:33:37.1762896Z             if compiled:
2025-05-07T20:33:37.1762991Z                 op = torch.compile(op)
2025-05-07T20:33:37.1763093Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1763202Z     
2025-05-07T20:33:37.1763289Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1763293Z 
2025-05-07T20:33:37.1763394Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1763518Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1763614Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1763715Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1764203Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1764294Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1764653Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1764916Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1765258Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1765348Z     kernel = self.compile(
2025-05-07T20:33:37.1765730Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1765899Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1766021Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1766026Z 
2025-05-07T20:33:37.1766231Z self = <triton.compiler.compiler.ASTSource object at 0x7f89ca499850>
2025-05-07T20:33:37.1766991Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1767490Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8819f12c00>}
2025-05-07T20:33:37.1768234Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1768424Z context = <triton._C.libtriton.ir.context object at 0x7f89ca1a7b30>
2025-05-07T20:33:37.1768428Z 
2025-05-07T20:33:37.1768591Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1768846Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1768948Z                            module_map=module_map)
2025-05-07T20:33:37.1769109Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1769205Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1769278Z E       ^
2025-05-07T20:33:37.1769672Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1769677Z 
2025-05-07T20:33:37.1770109Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1770113Z 
2025-05-07T20:33:37.1770214Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1770471Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1770552Z     T=2048,
2025-05-07T20:33:37.1770627Z     D=7168,
2025-05-07T20:33:37.1770706Z     scale_ub=None,
2025-05-07T20:33:37.1770793Z     contiguous=False,
2025-05-07T20:33:37.1770873Z     compiled=True,
2025-05-07T20:33:37.1770944Z )
2025-05-07T20:33:37.1771163Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1771337Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:37.1771342Z 
2025-05-07T20:33:37.1771416Z     @given(
2025-05-07T20:33:37.1771537Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1771673Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1771789Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1771902Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1772012Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1772088Z     )
2025-05-07T20:33:37.1772325Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1772414Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1772490Z         self,
2025-05-07T20:33:37.1772563Z         T: int,
2025-05-07T20:33:37.1772638Z         D: int,
2025-05-07T20:33:37.1772733Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1772818Z         contiguous: bool,
2025-05-07T20:33:37.1772941Z         compiled: bool,
2025-05-07T20:33:37.1773022Z     ) -> None:
2025-05-07T20:33:37.1773114Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1773194Z     
2025-05-07T20:33:37.1773360Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1773433Z     
2025-05-07T20:33:37.1773525Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1773645Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1773733Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1773816Z         x0 = x[:, :D]
2025-05-07T20:33:37.1773891Z         x1 = x[:, D:]
2025-05-07T20:33:37.1773966Z     
2025-05-07T20:33:37.1774051Z         if contiguous:
2025-05-07T20:33:37.1774138Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1774223Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1774293Z     
2025-05-07T20:33:37.1774378Z         if scale_ub is not None:
2025-05-07T20:33:37.1774479Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1774614Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1774687Z             )
2025-05-07T20:33:37.1774767Z         else:
2025-05-07T20:33:37.1774859Z             scale_ub_tensor = None
2025-05-07T20:33:37.1774928Z     
2025-05-07T20:33:37.1775062Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1775149Z             op = silu_mul_quant
2025-05-07T20:33:37.1775229Z             if compiled:
2025-05-07T20:33:37.1775330Z                 op = torch.compile(op)
2025-05-07T20:33:37.1775434Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1775502Z     
2025-05-07T20:33:37.1775592Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1775596Z 
2025-05-07T20:33:37.1775689Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1775816Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1775910Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1776004Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1776373Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:37.1776509Z     return fn(*args, **kwargs)
2025-05-07T20:33:37.1776998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1777101Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1777455Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1777720Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1778057Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1778147Z     kernel = self.compile(
2025-05-07T20:33:37.1778532Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1778706Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1778832Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1778841Z 
2025-05-07T20:33:37.1779078Z self = <triton.compiler.compiler.ASTSource object at 0x7f89ca719950>
2025-05-07T20:33:37.1779843Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1780374Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca1842c0>}
2025-05-07T20:33:37.1781128Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1781359Z context = <triton._C.libtriton.ir.context object at 0x7f89ca174530>
2025-05-07T20:33:37.1781364Z 
2025-05-07T20:33:37.1781526Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1781786Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1781899Z                            module_map=module_map)
2025-05-07T20:33:37.1782056Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1782153Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1782235Z E       ^
2025-05-07T20:33:37.1782583Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1782588Z 
2025-05-07T20:33:37.1783003Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1783010Z 
2025-05-07T20:33:37.1783109Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1783329Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1783411Z     T=4096,
2025-05-07T20:33:37.1783487Z     D=7168,
2025-05-07T20:33:37.1783572Z     scale_ub=None,
2025-05-07T20:33:37.1783660Z     contiguous=False,
2025-05-07T20:33:37.1783738Z     compiled=True,
2025-05-07T20:33:37.1783811Z )
2025-05-07T20:33:37.1784025Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1784197Z T = 4096, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:37.1784201Z 
2025-05-07T20:33:37.1784278Z     @given(
2025-05-07T20:33:37.1784391Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1784483Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1784594Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1784704Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1784818Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1784890Z     )
2025-05-07T20:33:37.1785171Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1785264Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1785336Z         self,
2025-05-07T20:33:37.1785410Z         T: int,
2025-05-07T20:33:37.1785485Z         D: int,
2025-05-07T20:33:37.1785579Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1785664Z         contiguous: bool,
2025-05-07T20:33:37.1785795Z         compiled: bool,
2025-05-07T20:33:37.1785871Z     ) -> None:
2025-05-07T20:33:37.1785963Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1786040Z     
2025-05-07T20:33:37.1786207Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1786281Z     
2025-05-07T20:33:37.1786370Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1786491Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1786583Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1786660Z         x0 = x[:, :D]
2025-05-07T20:33:37.1786736Z         x1 = x[:, D:]
2025-05-07T20:33:37.1786810Z     
2025-05-07T20:33:37.1786887Z         if contiguous:
2025-05-07T20:33:37.1787011Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1787100Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1787167Z     
2025-05-07T20:33:37.1787250Z         if scale_ub is not None:
2025-05-07T20:33:37.1787352Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1787537Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1787610Z             )
2025-05-07T20:33:37.1787683Z         else:
2025-05-07T20:33:37.1787772Z             scale_ub_tensor = None
2025-05-07T20:33:37.1787843Z     
2025-05-07T20:33:37.1787966Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1788051Z             op = silu_mul_quant
2025-05-07T20:33:37.1788132Z             if compiled:
2025-05-07T20:33:37.1788272Z                 op = torch.compile(op)
2025-05-07T20:33:37.1788375Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1788442Z     
2025-05-07T20:33:37.1788529Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1788534Z 
2025-05-07T20:33:37.1788626Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1788751Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1788845Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1788940Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1789304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:37.1789392Z     return fn(*args, **kwargs)
2025-05-07T20:33:37.1789880Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1789972Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1790328Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1790554Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1790896Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1790991Z     kernel = self.compile(
2025-05-07T20:33:37.1791367Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1791539Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1791663Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1791668Z 
2025-05-07T20:33:37.1791865Z self = <triton.compiler.compiler.ASTSource object at 0x7f89ca498fd0>
2025-05-07T20:33:37.1792628Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1793192Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca184d60>}
2025-05-07T20:33:37.1793928Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1794155Z context = <triton._C.libtriton.ir.context object at 0x7f89ca1830f0>
2025-05-07T20:33:37.1794159Z 
2025-05-07T20:33:37.1794319Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1794578Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1794684Z                            module_map=module_map)
2025-05-07T20:33:37.1794845Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1794944Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1795017Z E       ^
2025-05-07T20:33:37.1795410Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1795418Z 
2025-05-07T20:33:37.1795849Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1795856Z 
2025-05-07T20:33:37.1795954Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1796173Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1796246Z     T=16384,
2025-05-07T20:33:37.1796318Z     D=5120,
2025-05-07T20:33:37.1796401Z     scale_ub=1200.0,
2025-05-07T20:33:37.1796484Z     contiguous=False,
2025-05-07T20:33:37.1796564Z     compiled=False,
2025-05-07T20:33:37.1796637Z )
2025-05-07T20:33:37.1796893Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1797071Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:37.1797078Z 
2025-05-07T20:33:37.1797150Z     @given(
2025-05-07T20:33:37.1797266Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1797363Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1797474Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1797585Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1797703Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1797772Z     )
2025-05-07T20:33:37.1798017Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1798103Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1798178Z         self,
2025-05-07T20:33:37.1798257Z         T: int,
2025-05-07T20:33:37.1798330Z         D: int,
2025-05-07T20:33:37.1798425Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1798510Z         contiguous: bool,
2025-05-07T20:33:37.1798589Z         compiled: bool,
2025-05-07T20:33:37.1798661Z     ) -> None:
2025-05-07T20:33:37.1798755Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1798830Z     
2025-05-07T20:33:37.1798996Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1799072Z     
2025-05-07T20:33:37.1799158Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1799282Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1799365Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1799440Z         x0 = x[:, :D]
2025-05-07T20:33:37.1799516Z         x1 = x[:, D:]
2025-05-07T20:33:37.1799582Z     
2025-05-07T20:33:37.1799659Z         if contiguous:
2025-05-07T20:33:37.1799748Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1799833Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1799904Z     
2025-05-07T20:33:37.1799994Z         if scale_ub is not None:
2025-05-07T20:33:37.1800097Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1800224Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1800350Z             )
2025-05-07T20:33:37.1800424Z         else:
2025-05-07T20:33:37.1800516Z             scale_ub_tensor = None
2025-05-07T20:33:37.1800589Z     
2025-05-07T20:33:37.1800712Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1800799Z             op = silu_mul_quant
2025-05-07T20:33:37.1800880Z             if compiled:
2025-05-07T20:33:37.1801018Z                 op = torch.compile(op)
2025-05-07T20:33:37.1801121Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1801189Z     
2025-05-07T20:33:37.1801274Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1801279Z 
2025-05-07T20:33:37.1801372Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1801497Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1801592Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1801691Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1802183Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1802315Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1802672Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1802887Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1803229Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1803319Z     kernel = self.compile(
2025-05-07T20:33:37.1803714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1803884Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1804046Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1804050Z 
2025-05-07T20:33:37.1804253Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cbbe30d0>
2025-05-07T20:33:37.1805017Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1805512Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca185c60>}
2025-05-07T20:33:37.1806248Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1806430Z context = <triton._C.libtriton.ir.context object at 0x7f8819ecd3f0>
2025-05-07T20:33:37.1806437Z 
2025-05-07T20:33:37.1806601Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1806857Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1806970Z                            module_map=module_map)
2025-05-07T20:33:37.1807127Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1807222Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1807298Z E       ^
2025-05-07T20:33:37.1807645Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1807650Z 
2025-05-07T20:33:37.1808079Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1808086Z 
2025-05-07T20:33:37.1808182Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1808400Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1808481Z     T=16384,
2025-05-07T20:33:37.1808554Z     D=5120,
2025-05-07T20:33:37.1808679Z     scale_ub=1200.0,
2025-05-07T20:33:37.1808768Z     contiguous=True,
2025-05-07T20:33:37.1808855Z     compiled=True,
2025-05-07T20:33:37.1808921Z )
2025-05-07T20:33:37.1809137Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1809310Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:37.1809353Z 
2025-05-07T20:33:37.1809429Z     @given(
2025-05-07T20:33:37.1809543Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1809640Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1809756Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1809866Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1809974Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1810051Z     )
2025-05-07T20:33:37.1810298Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1810401Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1810490Z         self,
2025-05-07T20:33:37.1810575Z         T: int,
2025-05-07T20:33:37.1810701Z         D: int,
2025-05-07T20:33:37.1810797Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1810881Z         contiguous: bool,
2025-05-07T20:33:37.1810963Z         compiled: bool,
2025-05-07T20:33:37.1811037Z     ) -> None:
2025-05-07T20:33:37.1811130Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1811202Z     
2025-05-07T20:33:37.1811363Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1811432Z     
2025-05-07T20:33:37.1811522Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1811641Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1811724Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1811802Z         x0 = x[:, :D]
2025-05-07T20:33:37.1811917Z         x1 = x[:, D:]
2025-05-07T20:33:37.1811985Z     
2025-05-07T20:33:37.1812068Z         if contiguous:
2025-05-07T20:33:37.1812153Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1812240Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1812313Z     
2025-05-07T20:33:37.1812399Z         if scale_ub is not None:
2025-05-07T20:33:37.1812507Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1812637Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1812711Z             )
2025-05-07T20:33:37.1812793Z         else:
2025-05-07T20:33:37.1812883Z             scale_ub_tensor = None
2025-05-07T20:33:37.1812953Z     
2025-05-07T20:33:37.1813086Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1813174Z             op = silu_mul_quant
2025-05-07T20:33:37.1813258Z             if compiled:
2025-05-07T20:33:37.1813359Z                 op = torch.compile(op)
2025-05-07T20:33:37.1813462Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1813542Z     
2025-05-07T20:33:37.1813632Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1813636Z 
2025-05-07T20:33:37.1813734Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1813867Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1813966Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1814062Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1814425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:37.1814517Z     return fn(*args, **kwargs)
2025-05-07T20:33:37.1815001Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1815094Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1815446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1815669Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1816055Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1816147Z     kernel = self.compile(
2025-05-07T20:33:37.1816547Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1816714Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1816880Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1816884Z 
2025-05-07T20:33:37.1817081Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cafbe550>
2025-05-07T20:33:37.1817846Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1818355Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca187380>}
2025-05-07T20:33:37.1819124Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1819316Z context = <triton._C.libtriton.ir.context object at 0x7f8819e1d1b0>
2025-05-07T20:33:37.1819323Z 
2025-05-07T20:33:37.1819482Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1819737Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1819844Z                            module_map=module_map)
2025-05-07T20:33:37.1819999Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1820159Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1820237Z E       ^
2025-05-07T20:33:37.1820589Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1820593Z 
2025-05-07T20:33:37.1821038Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1821043Z 
2025-05-07T20:33:37.1821157Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1821383Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1821460Z     T=16384,
2025-05-07T20:33:37.1825017Z     D=5120,
2025-05-07T20:33:37.1825111Z     scale_ub=None,
2025-05-07T20:33:37.1825193Z     contiguous=False,
2025-05-07T20:33:37.1825268Z     compiled=True,
2025-05-07T20:33:37.1825339Z )
2025-05-07T20:33:37.1825554Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1825724Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:37.1825734Z 
2025-05-07T20:33:37.1825807Z     @given(
2025-05-07T20:33:37.1825928Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1826031Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1826142Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1826254Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1826364Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1826435Z     )
2025-05-07T20:33:37.1826678Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1826771Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1826845Z         self,
2025-05-07T20:33:37.1826918Z         T: int,
2025-05-07T20:33:37.1826994Z         D: int,
2025-05-07T20:33:37.1827087Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1827174Z         contiguous: bool,
2025-05-07T20:33:37.1827255Z         compiled: bool,
2025-05-07T20:33:37.1827331Z     ) -> None:
2025-05-07T20:33:37.1827490Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1827563Z     
2025-05-07T20:33:37.1827794Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1827867Z     
2025-05-07T20:33:37.1827955Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1828074Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1828160Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1828236Z         x0 = x[:, :D]
2025-05-07T20:33:37.1828354Z         x1 = x[:, D:]
2025-05-07T20:33:37.1828425Z     
2025-05-07T20:33:37.1828506Z         if contiguous:
2025-05-07T20:33:37.1828594Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1828682Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1828752Z     
2025-05-07T20:33:37.1828838Z         if scale_ub is not None:
2025-05-07T20:33:37.1828942Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1829071Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1829146Z             )
2025-05-07T20:33:37.1829216Z         else:
2025-05-07T20:33:37.1829306Z             scale_ub_tensor = None
2025-05-07T20:33:37.1829381Z     
2025-05-07T20:33:37.1829547Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1829637Z             op = silu_mul_quant
2025-05-07T20:33:37.1829725Z             if compiled:
2025-05-07T20:33:37.1829820Z                 op = torch.compile(op)
2025-05-07T20:33:37.1829920Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1829999Z     
2025-05-07T20:33:37.1830087Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1830091Z 
2025-05-07T20:33:37.1830190Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1830315Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1830412Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1830510Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1830881Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:37.1831017Z     return fn(*args, **kwargs)
2025-05-07T20:33:37.1831515Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1831608Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1831964Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1832182Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1832521Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1832613Z     kernel = self.compile(
2025-05-07T20:33:37.1832992Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1833160Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1833288Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1833295Z 
2025-05-07T20:33:37.1833493Z self = <triton.compiler.compiler.ASTSource object at 0x7f8819fae750>
2025-05-07T20:33:37.1834262Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1834756Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8819ea05e0>}
2025-05-07T20:33:37.1835492Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1835683Z context = <triton._C.libtriton.ir.context object at 0x7f8819e788b0>
2025-05-07T20:33:37.1835687Z 
2025-05-07T20:33:37.1835887Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1836154Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1836258Z                            module_map=module_map)
2025-05-07T20:33:37.1836416Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1836515Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1836630Z E       ^
2025-05-07T20:33:37.1836982Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1836987Z 
2025-05-07T20:33:37.1837417Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1837421Z 
2025-05-07T20:33:37.1837519Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1837744Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1837817Z     T=2048,
2025-05-07T20:33:37.1837896Z     D=5120,
2025-05-07T20:33:37.1837974Z     scale_ub=None,
2025-05-07T20:33:37.1838093Z     contiguous=False,
2025-05-07T20:33:37.1838174Z     compiled=True,
2025-05-07T20:33:37.1838243Z )
2025-05-07T20:33:37.1838455Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1838625Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:37.1838633Z 
2025-05-07T20:33:37.1838706Z     @given(
2025-05-07T20:33:37.1838820Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1838916Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1839026Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1839140Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1839248Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1839358Z     )
2025-05-07T20:33:37.1839598Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1839685Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1839761Z         self,
2025-05-07T20:33:37.1839835Z         T: int,
2025-05-07T20:33:37.1839908Z         D: int,
2025-05-07T20:33:37.1840001Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1840551Z         contiguous: bool,
2025-05-07T20:33:37.1840663Z         compiled: bool,
2025-05-07T20:33:37.1840741Z     ) -> None:
2025-05-07T20:33:37.1840835Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1840908Z     
2025-05-07T20:33:37.1841075Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1841147Z     
2025-05-07T20:33:37.1841235Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1841359Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1841448Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1841524Z         x0 = x[:, :D]
2025-05-07T20:33:37.1841603Z         x1 = x[:, D:]
2025-05-07T20:33:37.1841673Z     
2025-05-07T20:33:37.1841755Z         if contiguous:
2025-05-07T20:33:37.1841847Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1841935Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1842003Z     
2025-05-07T20:33:37.1842091Z         if scale_ub is not None:
2025-05-07T20:33:37.1842194Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1842324Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1842402Z             )
2025-05-07T20:33:37.1842476Z         else:
2025-05-07T20:33:37.1842568Z             scale_ub_tensor = None
2025-05-07T20:33:37.1842634Z     
2025-05-07T20:33:37.1842758Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1842847Z             op = silu_mul_quant
2025-05-07T20:33:37.1842928Z             if compiled:
2025-05-07T20:33:37.1843025Z                 op = torch.compile(op)
2025-05-07T20:33:37.1843131Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1843200Z     
2025-05-07T20:33:37.1843284Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1843383Z 
2025-05-07T20:33:37.1843479Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1843607Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1843703Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1843794Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1844153Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:37.1844301Z     return fn(*args, **kwargs)
2025-05-07T20:33:37.1844785Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1844876Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1845232Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1845451Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1845854Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1845943Z     kernel = self.compile(
2025-05-07T20:33:37.1846319Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1846487Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1846612Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1846617Z 
2025-05-07T20:33:37.1846815Z self = <triton.compiler.compiler.ASTSource object at 0x7f8819f234d0>
2025-05-07T20:33:37.1847582Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1848139Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8819ea1440>}
2025-05-07T20:33:37.1848873Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1849059Z context = <triton._C.libtriton.ir.context object at 0x7f8819dac1f0>
2025-05-07T20:33:37.1849063Z 
2025-05-07T20:33:37.1849223Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1849477Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1849579Z                            module_map=module_map)
2025-05-07T20:33:37.1849738Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1849833Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1849907Z E       ^
2025-05-07T20:33:37.1850263Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1850267Z 
2025-05-07T20:33:37.1850676Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1850680Z 
2025-05-07T20:33:37.1850776Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1850992Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1851064Z     T=2048,
2025-05-07T20:33:37.1851138Z     D=5120,
2025-05-07T20:33:37.1851212Z     scale_ub=1200.0,
2025-05-07T20:33:37.1851296Z     contiguous=False,
2025-05-07T20:33:37.1851373Z     compiled=True,
2025-05-07T20:33:37.1851441Z )
2025-05-07T20:33:37.1851654Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1851824Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:37.1851829Z 
2025-05-07T20:33:37.1851941Z     @given(
2025-05-07T20:33:37.1852060Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1852153Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1852261Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1852373Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1852479Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1852590Z     )
2025-05-07T20:33:37.1852826Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1852912Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1852987Z         self,
2025-05-07T20:33:37.1853059Z         T: int,
2025-05-07T20:33:37.1853130Z         D: int,
2025-05-07T20:33:37.1853225Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1853309Z         contiguous: bool,
2025-05-07T20:33:37.1853387Z         compiled: bool,
2025-05-07T20:33:37.1853462Z     ) -> None:
2025-05-07T20:33:37.1853550Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1853622Z     
2025-05-07T20:33:37.1853855Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1853926Z     
2025-05-07T20:33:37.1854013Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1854131Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1854214Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1854293Z         x0 = x[:, :D]
2025-05-07T20:33:37.1854366Z         x1 = x[:, D:]
2025-05-07T20:33:37.1854433Z     
2025-05-07T20:33:37.1854514Z         if contiguous:
2025-05-07T20:33:37.1854597Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1854677Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1854749Z     
2025-05-07T20:33:37.1854832Z         if scale_ub is not None:
2025-05-07T20:33:37.1854932Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1855105Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1855178Z             )
2025-05-07T20:33:37.1855253Z         else:
2025-05-07T20:33:37.1855343Z             scale_ub_tensor = None
2025-05-07T20:33:37.1855413Z     
2025-05-07T20:33:37.1855539Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1855623Z             op = silu_mul_quant
2025-05-07T20:33:37.1855701Z             if compiled:
2025-05-07T20:33:37.1855800Z                 op = torch.compile(op)
2025-05-07T20:33:37.1855901Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1855973Z     
2025-05-07T20:33:37.1856059Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1856063Z 
2025-05-07T20:33:37.1856153Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1856276Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1856371Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1856462Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1856827Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:37.1856917Z     return fn(*args, **kwargs)
2025-05-07T20:33:37.1857401Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1857496Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1857847Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1858066Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1858405Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1858491Z     kernel = self.compile(
2025-05-07T20:33:37.1858871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1859040Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1859205Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1859210Z 
2025-05-07T20:33:37.1859409Z self = <triton.compiler.compiler.ASTSource object at 0x7f8819e31950>
2025-05-07T20:33:37.1860169Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1860700Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8819ea2660>}
2025-05-07T20:33:37.1861430Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1861618Z context = <triton._C.libtriton.ir.context object at 0x7f8819ddd670>
2025-05-07T20:33:37.1861625Z 
2025-05-07T20:33:37.1861819Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1862073Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1862177Z                            module_map=module_map)
2025-05-07T20:33:37.1862330Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1862427Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1862502Z E       ^
2025-05-07T20:33:37.1862846Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1862850Z 
2025-05-07T20:33:37.1863256Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1863301Z 
2025-05-07T20:33:37.1863397Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1863615Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1863692Z     T=4096,
2025-05-07T20:33:37.1863768Z     D=5120,
2025-05-07T20:33:37.1863846Z     scale_ub=1200.0,
2025-05-07T20:33:37.1863928Z     contiguous=True,
2025-05-07T20:33:37.1864004Z     compiled=True,
2025-05-07T20:33:37.1864068Z )
2025-05-07T20:33:37.1864282Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1864450Z T = 4096, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:37.1864455Z 
2025-05-07T20:33:37.1864528Z     @given(
2025-05-07T20:33:37.1864639Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1864731Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1864843Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1864952Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1865061Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1865133Z     )
2025-05-07T20:33:37.1865371Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1865462Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1865534Z         self,
2025-05-07T20:33:37.1865603Z         T: int,
2025-05-07T20:33:37.1865674Z         D: int,
2025-05-07T20:33:37.1865764Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1865847Z         contiguous: bool,
2025-05-07T20:33:37.1865930Z         compiled: bool,
2025-05-07T20:33:37.1866002Z     ) -> None:
2025-05-07T20:33:37.1866089Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1866159Z     
2025-05-07T20:33:37.1866320Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1866389Z     
2025-05-07T20:33:37.1866477Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1866594Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1866681Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1866756Z         x0 = x[:, :D]
2025-05-07T20:33:37.1866829Z         x1 = x[:, D:]
2025-05-07T20:33:37.1866946Z     
2025-05-07T20:33:37.1867026Z         if contiguous:
2025-05-07T20:33:37.1867113Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1867200Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1867267Z     
2025-05-07T20:33:37.1867352Z         if scale_ub is not None:
2025-05-07T20:33:37.1867513Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1867685Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1867753Z             )
2025-05-07T20:33:37.1867826Z         else:
2025-05-07T20:33:37.1867913Z             scale_ub_tensor = None
2025-05-07T20:33:37.1867978Z     
2025-05-07T20:33:37.1868101Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1868183Z             op = silu_mul_quant
2025-05-07T20:33:37.1868264Z             if compiled:
2025-05-07T20:33:37.1868359Z                 op = torch.compile(op)
2025-05-07T20:33:37.1868459Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1868525Z     
2025-05-07T20:33:37.1868611Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1868615Z 
2025-05-07T20:33:37.1868745Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1868871Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1868963Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1869054Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1869425Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:37.1869511Z     return fn(*args, **kwargs)
2025-05-07T20:33:37.1869998Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1870091Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1870537Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1870757Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1871091Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1871182Z     kernel = self.compile(
2025-05-07T20:33:37.1871558Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1871728Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1871850Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1871854Z 
2025-05-07T20:33:37.1872049Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cabe44d0>
2025-05-07T20:33:37.1872809Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1873307Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8819ea39c0>}
2025-05-07T20:33:37.1874037Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1874225Z context = <triton._C.libtriton.ir.context object at 0x7f89cae8f030>
2025-05-07T20:33:37.1874229Z 
2025-05-07T20:33:37.1874384Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1874641Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1874742Z                            module_map=module_map)
2025-05-07T20:33:37.1874898Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1874990Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1875107Z E       ^
2025-05-07T20:33:37.1875455Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1875462Z 
2025-05-07T20:33:37.1875869Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1875912Z 
2025-05-07T20:33:37.1876008Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1876226Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1876299Z     T=128,
2025-05-07T20:33:37.1876370Z     D=5120,
2025-05-07T20:33:37.1876455Z     scale_ub=1200.0,
2025-05-07T20:33:37.1876537Z     contiguous=False,
2025-05-07T20:33:37.1876613Z     compiled=True,
2025-05-07T20:33:37.1876679Z )
2025-05-07T20:33:37.1876893Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1877062Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:37.1877066Z 
2025-05-07T20:33:37.1877138Z     @given(
2025-05-07T20:33:37.1877288Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1877385Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1877494Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1877603Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1877720Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1877787Z     )
2025-05-07T20:33:37.1878022Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1878111Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1878183Z         self,
2025-05-07T20:33:37.1878255Z         T: int,
2025-05-07T20:33:37.1878325Z         D: int,
2025-05-07T20:33:37.1878457Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1878541Z         contiguous: bool,
2025-05-07T20:33:37.1878619Z         compiled: bool,
2025-05-07T20:33:37.1878691Z     ) -> None:
2025-05-07T20:33:37.1878780Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1878850Z     
2025-05-07T20:33:37.1879011Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1879082Z     
2025-05-07T20:33:37.1879166Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1879285Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1879373Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1879446Z         x0 = x[:, :D]
2025-05-07T20:33:37.1879522Z         x1 = x[:, D:]
2025-05-07T20:33:37.1879589Z     
2025-05-07T20:33:37.1879664Z         if contiguous:
2025-05-07T20:33:37.1879752Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1879834Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1879900Z     
2025-05-07T20:33:37.1879983Z         if scale_ub is not None:
2025-05-07T20:33:37.1880084Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1880211Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1880285Z             )
2025-05-07T20:33:37.1880357Z         else:
2025-05-07T20:33:37.1880447Z             scale_ub_tensor = None
2025-05-07T20:33:37.1880514Z     
2025-05-07T20:33:37.1880637Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1880723Z             op = silu_mul_quant
2025-05-07T20:33:37.1880805Z             if compiled:
2025-05-07T20:33:37.1880902Z                 op = torch.compile(op)
2025-05-07T20:33:37.1881003Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1881070Z     
2025-05-07T20:33:37.1881154Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1881159Z 
2025-05-07T20:33:37.1881255Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1881376Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1881471Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1881565Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1881973Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:37.1882070Z     return fn(*args, **kwargs)
2025-05-07T20:33:37.1882557Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1882649Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1883070Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1883288Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1883624Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1883715Z     kernel = self.compile(
2025-05-07T20:33:37.1884097Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1884277Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1884436Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1884441Z 
2025-05-07T20:33:37.1884638Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cbbe0b50>
2025-05-07T20:33:37.1885400Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1885897Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89cae84fe0>}
2025-05-07T20:33:37.1886633Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1886856Z context = <triton._C.libtriton.ir.context object at 0x7f89cae79670>
2025-05-07T20:33:37.1886861Z 
2025-05-07T20:33:37.1887019Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1887275Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1887375Z                            module_map=module_map)
2025-05-07T20:33:37.1887536Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1887630Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1887702Z E       ^
2025-05-07T20:33:37.1888055Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1888059Z 
2025-05-07T20:33:37.1888469Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1888476Z 
2025-05-07T20:33:37.1888578Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1888798Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1888873Z     T=16384,
2025-05-07T20:33:37.1888947Z     D=7168,
2025-05-07T20:33:37.1889024Z     scale_ub=1200.0,
2025-05-07T20:33:37.1889103Z     contiguous=True,
2025-05-07T20:33:37.1889185Z     compiled=True,
2025-05-07T20:33:37.1889253Z )
2025-05-07T20:33:37.1889467Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1889640Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:37.1889644Z 
2025-05-07T20:33:37.1889715Z     @given(
2025-05-07T20:33:37.1889833Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1889926Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1890036Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1890153Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1890262Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1890377Z     )
2025-05-07T20:33:37.1890622Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1890711Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1890784Z         self,
2025-05-07T20:33:37.1890860Z         T: int,
2025-05-07T20:33:37.1890931Z         D: int,
2025-05-07T20:33:37.1891027Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1891158Z         contiguous: bool,
2025-05-07T20:33:37.1891239Z         compiled: bool,
2025-05-07T20:33:37.1891317Z     ) -> None:
2025-05-07T20:33:37.1891407Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1891474Z     
2025-05-07T20:33:37.1891641Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1891710Z     
2025-05-07T20:33:37.1891798Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1891923Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1892008Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1892086Z         x0 = x[:, :D]
2025-05-07T20:33:37.1892165Z         x1 = x[:, D:]
2025-05-07T20:33:37.1892276Z     
2025-05-07T20:33:37.1892355Z         if contiguous:
2025-05-07T20:33:37.1892444Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1892527Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1892602Z     
2025-05-07T20:33:37.1892685Z         if scale_ub is not None:
2025-05-07T20:33:37.1892789Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1892918Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1892992Z             )
2025-05-07T20:33:37.1893064Z         else:
2025-05-07T20:33:37.1893158Z             scale_ub_tensor = None
2025-05-07T20:33:37.1893227Z     
2025-05-07T20:33:37.1893350Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1893478Z             op = silu_mul_quant
2025-05-07T20:33:37.1893556Z             if compiled:
2025-05-07T20:33:37.1893649Z                 op = torch.compile(op)
2025-05-07T20:33:37.1893754Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1893820Z     
2025-05-07T20:33:37.1893910Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1893914Z 
2025-05-07T20:33:37.1894010Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1894132Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1894229Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1894324Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1894686Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:37.1894779Z     return fn(*args, **kwargs)
2025-05-07T20:33:37.1895263Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1895359Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1895714Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1895931Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1896269Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1896358Z     kernel = self.compile(
2025-05-07T20:33:37.1896753Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1896927Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1897046Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1897051Z 
2025-05-07T20:33:37.1897245Z self = <triton.compiler.compiler.ASTSource object at 0x7f8819fad850>
2025-05-07T20:33:37.1898048Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1898550Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89cae85e40>}
2025-05-07T20:33:37.1899283Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1899507Z context = <triton._C.libtriton.ir.context object at 0x7f89ca3330b0>
2025-05-07T20:33:37.1899512Z 
2025-05-07T20:33:37.1899672Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1899926Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1900037Z                            module_map=module_map)
2025-05-07T20:33:37.1900193Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1900300Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1900428Z E       ^
2025-05-07T20:33:37.1900801Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1900806Z 
2025-05-07T20:33:37.1901220Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1901231Z 
2025-05-07T20:33:37.1901326Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1901541Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1901617Z     T=16384,
2025-05-07T20:33:37.1901687Z     D=5120,
2025-05-07T20:33:37.1901765Z     scale_ub=1200.0,
2025-05-07T20:33:37.1901850Z     contiguous=True,
2025-05-07T20:33:37.1901970Z     compiled=False,
2025-05-07T20:33:37.1902036Z )
2025-05-07T20:33:37.1902249Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1902422Z T = 16384, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:37.1902429Z 
2025-05-07T20:33:37.1902497Z     @given(
2025-05-07T20:33:37.1902614Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1902705Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1902820Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1902932Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1903041Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1903109Z     )
2025-05-07T20:33:37.1903347Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1903434Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1903509Z         self,
2025-05-07T20:33:37.1903583Z         T: int,
2025-05-07T20:33:37.1903656Z         D: int,
2025-05-07T20:33:37.1903754Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1903838Z         contiguous: bool,
2025-05-07T20:33:37.1903923Z         compiled: bool,
2025-05-07T20:33:37.1903995Z     ) -> None:
2025-05-07T20:33:37.1904088Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1904157Z     
2025-05-07T20:33:37.1904319Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1904387Z     
2025-05-07T20:33:37.1904475Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1904595Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1904678Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1904753Z         x0 = x[:, :D]
2025-05-07T20:33:37.1904825Z         x1 = x[:, D:]
2025-05-07T20:33:37.1904891Z     
2025-05-07T20:33:37.1904973Z         if contiguous:
2025-05-07T20:33:37.1905059Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1905146Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1905222Z     
2025-05-07T20:33:37.1905306Z         if scale_ub is not None:
2025-05-07T20:33:37.1905411Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1905587Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1905663Z             )
2025-05-07T20:33:37.1905738Z         else:
2025-05-07T20:33:37.1905825Z             scale_ub_tensor = None
2025-05-07T20:33:37.1905890Z     
2025-05-07T20:33:37.1906017Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1906105Z             op = silu_mul_quant
2025-05-07T20:33:37.1906224Z             if compiled:
2025-05-07T20:33:37.1906321Z                 op = torch.compile(op)
2025-05-07T20:33:37.1906421Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1906486Z     
2025-05-07T20:33:37.1906572Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1906577Z 
2025-05-07T20:33:37.1906667Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1906792Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1906890Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1906982Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1907571Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1907666Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1908020Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1908244Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1908580Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1908674Z     kernel = self.compile(
2025-05-07T20:33:37.1909053Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1909275Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1909407Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1909414Z 
2025-05-07T20:33:37.1909613Z self = <triton.compiler.compiler.ASTSource object at 0x7f89cbbe2850>
2025-05-07T20:33:37.1910386Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1910931Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89cae86ca0>}
2025-05-07T20:33:37.1911662Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1911850Z context = <triton._C.libtriton.ir.context object at 0x7f89ca328ab0>
2025-05-07T20:33:37.1911854Z 
2025-05-07T20:33:37.1912016Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1912273Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1912375Z                            module_map=module_map)
2025-05-07T20:33:37.1912530Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1912630Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1912703Z E       ^
2025-05-07T20:33:37.1913058Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1913062Z 
2025-05-07T20:33:37.1913475Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1913480Z 
2025-05-07T20:33:37.1913584Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1913804Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1913877Z     T=1,
2025-05-07T20:33:37.1914016Z     D=7168,
2025-05-07T20:33:37.1914100Z     scale_ub=1200.0,
2025-05-07T20:33:37.1914184Z     contiguous=False,
2025-05-07T20:33:37.1914267Z     compiled=False,
2025-05-07T20:33:37.1914336Z )
2025-05-07T20:33:37.1914550Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1914719Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:37.1914763Z 
2025-05-07T20:33:37.1914840Z     @given(
2025-05-07T20:33:37.1914954Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1915053Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1915162Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1915274Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1915394Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1915466Z     )
2025-05-07T20:33:37.1915708Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1915796Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1915909Z         self,
2025-05-07T20:33:37.1915987Z         T: int,
2025-05-07T20:33:37.1916060Z         D: int,
2025-05-07T20:33:37.1916151Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1916240Z         contiguous: bool,
2025-05-07T20:33:37.1916321Z         compiled: bool,
2025-05-07T20:33:37.1916399Z     ) -> None:
2025-05-07T20:33:37.1916493Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1916559Z     
2025-05-07T20:33:37.1916720Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1916795Z     
2025-05-07T20:33:37.1916881Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1917004Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1917087Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1917202Z         x0 = x[:, :D]
2025-05-07T20:33:37.1917280Z         x1 = x[:, D:]
2025-05-07T20:33:37.1917348Z     
2025-05-07T20:33:37.1917430Z         if contiguous:
2025-05-07T20:33:37.1917521Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1917606Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1917672Z     
2025-05-07T20:33:37.1917762Z         if scale_ub is not None:
2025-05-07T20:33:37.1917863Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1917992Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1918073Z             )
2025-05-07T20:33:37.1918142Z         else:
2025-05-07T20:33:37.1918235Z             scale_ub_tensor = None
2025-05-07T20:33:37.1918301Z     
2025-05-07T20:33:37.1918424Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1918513Z             op = silu_mul_quant
2025-05-07T20:33:37.1918594Z             if compiled:
2025-05-07T20:33:37.1918688Z                 op = torch.compile(op)
2025-05-07T20:33:37.1918798Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1918865Z     
2025-05-07T20:33:37.1918952Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1918956Z 
2025-05-07T20:33:37.1919053Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1919180Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1919272Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1919368Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1919859Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1919959Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1920314Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1920530Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1920871Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1920958Z     kernel = self.compile(
2025-05-07T20:33:37.1921388Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1921556Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1921677Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1921720Z 
2025-05-07T20:33:37.1921922Z self = <triton.compiler.compiler.ASTSource object at 0x7f89ca1e5bd0>
2025-05-07T20:33:37.1922685Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1923178Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca3940e0>}
2025-05-07T20:33:37.1923952Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1924135Z context = <triton._C.libtriton.ir.context object at 0x7f8819cad370>
2025-05-07T20:33:37.1924139Z 
2025-05-07T20:33:37.1924299Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1924555Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1924660Z                            module_map=module_map)
2025-05-07T20:33:37.1924816Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1924909Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1924987Z E       ^
2025-05-07T20:33:37.1925336Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1925381Z 
2025-05-07T20:33:37.1925797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1925802Z 
2025-05-07T20:33:37.1925900Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1926116Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1926190Z     T=4096,
2025-05-07T20:33:37.1926264Z     D=7168,
2025-05-07T20:33:37.1926342Z     scale_ub=1200.0,
2025-05-07T20:33:37.1926429Z     contiguous=False,
2025-05-07T20:33:37.1926507Z     compiled=True,
2025-05-07T20:33:37.1926574Z )
2025-05-07T20:33:37.1926787Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1926952Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:37.1926957Z 
2025-05-07T20:33:37.1927034Z     @given(
2025-05-07T20:33:37.1927146Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1927240Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1927356Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1927468Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1927578Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1927653Z     )
2025-05-07T20:33:37.1927890Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1927978Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1928051Z         self,
2025-05-07T20:33:37.1928124Z         T: int,
2025-05-07T20:33:37.1928197Z         D: int,
2025-05-07T20:33:37.1928288Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1928371Z         contiguous: bool,
2025-05-07T20:33:37.1928456Z         compiled: bool,
2025-05-07T20:33:37.1928527Z     ) -> None:
2025-05-07T20:33:37.1928615Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1928684Z     
2025-05-07T20:33:37.1928844Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1928911Z     
2025-05-07T20:33:37.1929045Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1929166Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1929249Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1929326Z         x0 = x[:, :D]
2025-05-07T20:33:37.1929399Z         x1 = x[:, D:]
2025-05-07T20:33:37.1929468Z     
2025-05-07T20:33:37.1929553Z         if contiguous:
2025-05-07T20:33:37.1929676Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1929763Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1929828Z     
2025-05-07T20:33:37.1929911Z         if scale_ub is not None:
2025-05-07T20:33:37.1930014Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1930141Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1930210Z             )
2025-05-07T20:33:37.1930282Z         else:
2025-05-07T20:33:37.1930371Z             scale_ub_tensor = None
2025-05-07T20:33:37.1930435Z     
2025-05-07T20:33:37.1930559Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1930646Z             op = silu_mul_quant
2025-05-07T20:33:37.1930765Z             if compiled:
2025-05-07T20:33:37.1930863Z                 op = torch.compile(op)
2025-05-07T20:33:37.1930962Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1931029Z     
2025-05-07T20:33:37.1931114Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1931121Z 
2025-05-07T20:33:37.1931212Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1931339Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1931432Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1931524Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1931888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:37.1932020Z     return fn(*args, **kwargs)
2025-05-07T20:33:37.1932510Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1932602Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1932955Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1933174Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1933511Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1933599Z     kernel = self.compile(
2025-05-07T20:33:37.1933982Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1934149Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1934275Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1934282Z 
2025-05-07T20:33:37.1934478Z self = <triton.compiler.compiler.ASTSource object at 0x7f8819f20150>
2025-05-07T20:33:37.1935247Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1935742Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca395300>}
2025-05-07T20:33:37.1936471Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1936656Z context = <triton._C.libtriton.ir.context object at 0x7f8819ca0630>
2025-05-07T20:33:37.1936663Z 
2025-05-07T20:33:37.1936819Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1937117Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1937223Z                            module_map=module_map)
2025-05-07T20:33:37.1937377Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1937470Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1937542Z E       ^
2025-05-07T20:33:37.1937887Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1937930Z 
2025-05-07T20:33:37.1938343Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1938348Z 
2025-05-07T20:33:37.1938444Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1938666Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1938742Z     T=128,
2025-05-07T20:33:37.1938813Z     D=7168,
2025-05-07T20:33:37.1938892Z     scale_ub=1200.0,
2025-05-07T20:33:37.1938978Z     contiguous=False,
2025-05-07T20:33:37.1939056Z     compiled=True,
2025-05-07T20:33:37.1939127Z )
2025-05-07T20:33:37.1939375Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1939541Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = False, compiled = True
2025-05-07T20:33:37.1939545Z 
2025-05-07T20:33:37.1939619Z     @given(
2025-05-07T20:33:37.1939737Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1939837Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1939948Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1943753Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1943892Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1943966Z     )
2025-05-07T20:33:37.1944212Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1944449Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1944526Z         self,
2025-05-07T20:33:37.1944607Z         T: int,
2025-05-07T20:33:37.1944684Z         D: int,
2025-05-07T20:33:37.1944784Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1944868Z         contiguous: bool,
2025-05-07T20:33:37.1944956Z         compiled: bool,
2025-05-07T20:33:37.1945032Z     ) -> None:
2025-05-07T20:33:37.1945130Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1945207Z     
2025-05-07T20:33:37.1945374Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1945447Z     
2025-05-07T20:33:37.1945537Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1945659Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1945747Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1945824Z         x0 = x[:, :D]
2025-05-07T20:33:37.1945900Z         x1 = x[:, D:]
2025-05-07T20:33:37.1945973Z     
2025-05-07T20:33:37.1946051Z         if contiguous:
2025-05-07T20:33:37.1946138Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1946232Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1946299Z     
2025-05-07T20:33:37.1946392Z         if scale_ub is not None:
2025-05-07T20:33:37.1946494Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1946625Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1946701Z             )
2025-05-07T20:33:37.1946775Z         else:
2025-05-07T20:33:37.1946868Z             scale_ub_tensor = None
2025-05-07T20:33:37.1946940Z     
2025-05-07T20:33:37.1947066Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1947151Z             op = silu_mul_quant
2025-05-07T20:33:37.1947238Z             if compiled:
2025-05-07T20:33:37.1947335Z                 op = torch.compile(op)
2025-05-07T20:33:37.1947501Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1947577Z     
2025-05-07T20:33:37.1947665Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1947670Z 
2025-05-07T20:33:37.1947770Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1947970Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1948072Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1948169Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1948538Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:37.1948687Z     return fn(*args, **kwargs)
2025-05-07T20:33:37.1949179Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1949271Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1949630Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1949847Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1950192Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1950284Z     kernel = self.compile(
2025-05-07T20:33:37.1950739Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1950911Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1951036Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1951044Z 
2025-05-07T20:33:37.1951245Z self = <triton.compiler.compiler.ASTSource object at 0x7f8819b91cd0>
2025-05-07T20:33:37.1952023Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1952552Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca396160>}
2025-05-07T20:33:37.1953293Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1953476Z context = <triton._C.libtriton.ir.context object at 0x7f8819cf8770>
2025-05-07T20:33:37.1953484Z 
2025-05-07T20:33:37.1953641Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1953903Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1954007Z                            module_map=module_map)
2025-05-07T20:33:37.1954169Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1954263Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1954339Z E       ^
2025-05-07T20:33:37.1954691Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1954699Z 
2025-05-07T20:33:37.1955110Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1955114Z 
2025-05-07T20:33:37.1955214Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1955433Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1955507Z     T=2048,
2025-05-07T20:33:37.1955583Z     D=7168,
2025-05-07T20:33:37.1955666Z     scale_ub=None,
2025-05-07T20:33:37.1955745Z     contiguous=True,
2025-05-07T20:33:37.1955828Z     compiled=True,
2025-05-07T20:33:37.1955897Z )
2025-05-07T20:33:37.1956106Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1956273Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:37.1956281Z 
2025-05-07T20:33:37.1956354Z     @given(
2025-05-07T20:33:37.1956466Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1956612Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1956725Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1956838Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1956945Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1957016Z     )
2025-05-07T20:33:37.1957254Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1957379Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1957453Z         self,
2025-05-07T20:33:37.1957530Z         T: int,
2025-05-07T20:33:37.1957598Z         D: int,
2025-05-07T20:33:37.1957692Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1957777Z         contiguous: bool,
2025-05-07T20:33:37.1957857Z         compiled: bool,
2025-05-07T20:33:37.1957931Z     ) -> None:
2025-05-07T20:33:37.1958028Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1958096Z     
2025-05-07T20:33:37.1958265Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1958331Z     
2025-05-07T20:33:37.1958458Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1958583Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1958666Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1958739Z         x0 = x[:, :D]
2025-05-07T20:33:37.1958816Z         x1 = x[:, D:]
2025-05-07T20:33:37.1958887Z     
2025-05-07T20:33:37.1958963Z         if contiguous:
2025-05-07T20:33:37.1959051Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1959136Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1959205Z     
2025-05-07T20:33:37.1959294Z         if scale_ub is not None:
2025-05-07T20:33:37.1959394Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1959523Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1959640Z             )
2025-05-07T20:33:37.1959714Z         else:
2025-05-07T20:33:37.1959807Z             scale_ub_tensor = None
2025-05-07T20:33:37.1959876Z     
2025-05-07T20:33:37.1960004Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1960095Z             op = silu_mul_quant
2025-05-07T20:33:37.1960176Z             if compiled:
2025-05-07T20:33:37.1960270Z                 op = torch.compile(op)
2025-05-07T20:33:37.1960373Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1960441Z     
2025-05-07T20:33:37.1960530Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1960535Z 
2025-05-07T20:33:37.1960630Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1960753Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1960851Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1960942Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1961304Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:37.1961398Z     return fn(*args, **kwargs)
2025-05-07T20:33:37.1961888Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.1961979Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.1962333Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.1962547Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.1962885Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.1962972Z     kernel = self.compile(
2025-05-07T20:33:37.1963350Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.1963522Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.1963643Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1963648Z 
2025-05-07T20:33:37.1963890Z self = <triton.compiler.compiler.ASTSource object at 0x7f8819face50>
2025-05-07T20:33:37.1964655Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.1965144Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f89ca397420>}
2025-05-07T20:33:37.1965918Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.1966100Z context = <triton._C.libtriton.ir.context object at 0x7f88198f45b0>
2025-05-07T20:33:37.1966107Z 
2025-05-07T20:33:37.1966264Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.1966559Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.1966662Z                            module_map=module_map)
2025-05-07T20:33:37.1966824Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.1966916Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.1966994Z E       ^
2025-05-07T20:33:37.1967341Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.1967346Z 
2025-05-07T20:33:37.1967773Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.1967777Z 
2025-05-07T20:33:37.1967875Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1968134Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1968211Z     T=16384,
2025-05-07T20:33:37.1968287Z     D=5120,
2025-05-07T20:33:37.1968367Z     scale_ub=None,
2025-05-07T20:33:37.1968452Z     contiguous=False,
2025-05-07T20:33:37.1968533Z     compiled=False,
2025-05-07T20:33:37.1968602Z )
2025-05-07T20:33:37.1968816Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1968988Z T = 16384, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:37.1968996Z 
2025-05-07T20:33:37.1969064Z     @given(
2025-05-07T20:33:37.1969179Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1969269Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1969377Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1969490Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1969595Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1969667Z     )
2025-05-07T20:33:37.1969904Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1969993Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1970067Z         self,
2025-05-07T20:33:37.1970143Z         T: int,
2025-05-07T20:33:37.1970213Z         D: int,
2025-05-07T20:33:37.1970307Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1970412Z         contiguous: bool,
2025-05-07T20:33:37.1970495Z         compiled: bool,
2025-05-07T20:33:37.1970587Z     ) -> None:
2025-05-07T20:33:37.1970682Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1970749Z     
2025-05-07T20:33:37.1970912Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1970978Z     
2025-05-07T20:33:37.1971066Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1971185Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1973006Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.60 GiB is allocated by PyTorch, and 45.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:37.1973020Z 
2025-05-07T20:33:37.1973171Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:37.1973176Z 
2025-05-07T20:33:37.1973269Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1973489Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1973559Z     T=4096,
2025-05-07T20:33:37.1973625Z     D=7168,
2025-05-07T20:33:37.1973708Z     scale_ub=1200.0,
2025-05-07T20:33:37.1973784Z     contiguous=True,
2025-05-07T20:33:37.1973862Z     compiled=True,
2025-05-07T20:33:37.1973932Z )
2025-05-07T20:33:37.1974141Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1974316Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:37.1974383Z 
2025-05-07T20:33:37.1974458Z     @given(
2025-05-07T20:33:37.1974568Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1974660Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1974765Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1974876Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1974984Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1975055Z     )
2025-05-07T20:33:37.1975297Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1975383Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1975454Z         self,
2025-05-07T20:33:37.1975532Z         T: int,
2025-05-07T20:33:37.1975647Z         D: int,
2025-05-07T20:33:37.1975738Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1975825Z         contiguous: bool,
2025-05-07T20:33:37.1975911Z         compiled: bool,
2025-05-07T20:33:37.1975983Z     ) -> None:
2025-05-07T20:33:37.1976075Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1976143Z     
2025-05-07T20:33:37.1976302Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1976373Z     
2025-05-07T20:33:37.1976457Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1976579Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1978335Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.61 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:37.1978344Z 
2025-05-07T20:33:37.1978459Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:37.1978463Z 
2025-05-07T20:33:37.1978557Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1978769Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1978846Z     T=16384,
2025-05-07T20:33:37.1978919Z     D=7168,
2025-05-07T20:33:37.1978994Z     scale_ub=None,
2025-05-07T20:33:37.1979074Z     contiguous=False,
2025-05-07T20:33:37.1979152Z     compiled=False,
2025-05-07T20:33:37.1979216Z )
2025-05-07T20:33:37.1979425Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1979591Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:37.1979598Z 
2025-05-07T20:33:37.1979673Z     @given(
2025-05-07T20:33:37.1979782Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1979919Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1980030Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1980142Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1980246Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1980318Z     )
2025-05-07T20:33:37.1980553Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1980678Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1980753Z         self,
2025-05-07T20:33:37.1980825Z         T: int,
2025-05-07T20:33:37.1980896Z         D: int,
2025-05-07T20:33:37.1980989Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1981071Z         contiguous: bool,
2025-05-07T20:33:37.1981152Z         compiled: bool,
2025-05-07T20:33:37.1981223Z     ) -> None:
2025-05-07T20:33:37.1981311Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1981377Z     
2025-05-07T20:33:37.1981536Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1983330Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 140.44 MiB is free. Including non-PyTorch memory, this process has 21.92 GiB memory in use. Of the allocated memory 21.50 GiB is allocated by PyTorch, and 141.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:37.1983342Z 
2025-05-07T20:33:37.1983453Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:37.1983457Z 
2025-05-07T20:33:37.1983551Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1983770Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1983878Z     T=2048,
2025-05-07T20:33:37.1983946Z     D=7168,
2025-05-07T20:33:37.1984025Z     scale_ub=1200.0,
2025-05-07T20:33:37.1984105Z     contiguous=True,
2025-05-07T20:33:37.1984185Z     compiled=True,
2025-05-07T20:33:37.1984252Z )
2025-05-07T20:33:37.1984460Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1984625Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:37.1984629Z 
2025-05-07T20:33:37.1984701Z     @given(
2025-05-07T20:33:37.1984809Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1984902Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1985007Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1985114Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1985221Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1985290Z     )
2025-05-07T20:33:37.1985527Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1985612Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1985684Z         self,
2025-05-07T20:33:37.1985754Z         T: int,
2025-05-07T20:33:37.1985826Z         D: int,
2025-05-07T20:33:37.1985916Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1986001Z         contiguous: bool,
2025-05-07T20:33:37.1986083Z         compiled: bool,
2025-05-07T20:33:37.1986156Z     ) -> None:
2025-05-07T20:33:37.1986245Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1986313Z     
2025-05-07T20:33:37.1986471Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1986541Z     
2025-05-07T20:33:37.1986624Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1986743Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1988580Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:37.1988591Z 
2025-05-07T20:33:37.1988706Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:37.1988749Z 
2025-05-07T20:33:37.1988843Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1989054Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1989124Z     T=2048,
2025-05-07T20:33:37.1989192Z     D=7168,
2025-05-07T20:33:37.1989266Z     scale_ub=None,
2025-05-07T20:33:37.1989345Z     contiguous=True,
2025-05-07T20:33:37.1989421Z     compiled=False,
2025-05-07T20:33:37.1989487Z )
2025-05-07T20:33:37.1989696Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1989862Z T = 2048, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:37.1989867Z 
2025-05-07T20:33:37.1989977Z     @given(
2025-05-07T20:33:37.1990088Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1990179Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1990287Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1990399Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1990503Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1990574Z     )
2025-05-07T20:33:37.1990806Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1990896Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1990966Z         self,
2025-05-07T20:33:37.1991035Z         T: int,
2025-05-07T20:33:37.1991106Z         D: int,
2025-05-07T20:33:37.1991238Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1991320Z         contiguous: bool,
2025-05-07T20:33:37.1991401Z         compiled: bool,
2025-05-07T20:33:37.1991473Z     ) -> None:
2025-05-07T20:33:37.1991562Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1991631Z     
2025-05-07T20:33:37.1991788Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1991852Z     
2025-05-07T20:33:37.1991940Z >       x_sign = torch.sign(x)
2025-05-07T20:33:37.1993672Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 28.44 MiB is free. Including non-PyTorch memory, this process has 22.03 GiB memory in use. Of the allocated memory 21.67 GiB is allocated by PyTorch, and 85.02 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:37.1993683Z 
2025-05-07T20:33:37.1993797Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:37.1993802Z 
2025-05-07T20:33:37.1993897Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.1994111Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.1994181Z     T=1,
2025-05-07T20:33:37.1994249Z     D=7168,
2025-05-07T20:33:37.1994326Z     scale_ub=1200.0,
2025-05-07T20:33:37.1994404Z     contiguous=True,
2025-05-07T20:33:37.1994477Z     compiled=False,
2025-05-07T20:33:37.1994543Z )
2025-05-07T20:33:37.1994751Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.1994907Z T = 1, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:37.1994912Z 
2025-05-07T20:33:37.1994983Z     @given(
2025-05-07T20:33:37.1995090Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.1995187Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.1995292Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.1995444Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.1995555Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.1995624Z     )
2025-05-07T20:33:37.1995857Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.1995946Z     def test_silu_mul_quant(
2025-05-07T20:33:37.1996018Z         self,
2025-05-07T20:33:37.1996124Z         T: int,
2025-05-07T20:33:37.1996196Z         D: int,
2025-05-07T20:33:37.1996284Z         scale_ub: Optional[float],
2025-05-07T20:33:37.1996365Z         contiguous: bool,
2025-05-07T20:33:37.1996445Z         compiled: bool,
2025-05-07T20:33:37.1996515Z     ) -> None:
2025-05-07T20:33:37.1996605Z         torch.manual_seed(2025)
2025-05-07T20:33:37.1996671Z     
2025-05-07T20:33:37.1996827Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.1996898Z     
2025-05-07T20:33:37.1996982Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.1997101Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.1997185Z         x = x_sign * x_clamp
2025-05-07T20:33:37.1997296Z         x0 = x[:, :D]
2025-05-07T20:33:37.1997369Z         x1 = x[:, D:]
2025-05-07T20:33:37.1997439Z     
2025-05-07T20:33:37.1997515Z         if contiguous:
2025-05-07T20:33:37.1997599Z             x0 = x0.contiguous()
2025-05-07T20:33:37.1997684Z             x1 = x1.contiguous()
2025-05-07T20:33:37.1997753Z     
2025-05-07T20:33:37.1997835Z         if scale_ub is not None:
2025-05-07T20:33:37.1997936Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.1998062Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.1998131Z             )
2025-05-07T20:33:37.1998202Z         else:
2025-05-07T20:33:37.1998289Z             scale_ub_tensor = None
2025-05-07T20:33:37.1998396Z     
2025-05-07T20:33:37.1998520Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.1998602Z             op = silu_mul_quant
2025-05-07T20:33:37.1998684Z             if compiled:
2025-05-07T20:33:37.1998776Z                 op = torch.compile(op)
2025-05-07T20:33:37.1998876Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1998946Z     
2025-05-07T20:33:37.1999029Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.1999034Z 
2025-05-07T20:33:37.1999125Z moe/activation_test.py:117: 
2025-05-07T20:33:37.1999248Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.1999341Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.1999438Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.1999933Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.2000025Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.2000411Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.2000656Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.2000995Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.2001082Z     kernel = self.compile(
2025-05-07T20:33:37.2001476Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.2001654Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.2001773Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.2001778Z 
2025-05-07T20:33:37.2001970Z self = <triton.compiler.compiler.ASTSource object at 0x7f881994c850>
2025-05-07T20:33:37.2002734Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.2003271Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f88199aa2a0>}
2025-05-07T20:33:37.2004007Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.2004251Z context = <triton._C.libtriton.ir.context object at 0x7f88199f64b0>
2025-05-07T20:33:37.2004256Z 
2025-05-07T20:33:37.2004414Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.2004669Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.2004775Z                            module_map=module_map)
2025-05-07T20:33:37.2004936Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.2005030Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.2005102Z E       ^
2025-05-07T20:33:37.2005493Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.2005498Z 
2025-05-07T20:33:37.2005912Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.2005919Z 
2025-05-07T20:33:37.2006019Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.2006234Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.2006306Z     T=128,
2025-05-07T20:33:37.2006385Z     D=5120,
2025-05-07T20:33:37.2006462Z     scale_ub=None,
2025-05-07T20:33:37.2006541Z     contiguous=True,
2025-05-07T20:33:37.2006621Z     compiled=False,
2025-05-07T20:33:37.2006688Z )
2025-05-07T20:33:37.2006941Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.2007106Z T = 128, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:37.2007114Z 
2025-05-07T20:33:37.2007182Z     @given(
2025-05-07T20:33:37.2007300Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.2007395Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.2007505Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.2007618Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.2007727Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.2007796Z     )
2025-05-07T20:33:37.2008031Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.2008121Z     def test_silu_mul_quant(
2025-05-07T20:33:37.2008192Z         self,
2025-05-07T20:33:37.2008263Z         T: int,
2025-05-07T20:33:37.2008334Z         D: int,
2025-05-07T20:33:37.2008434Z         scale_ub: Optional[float],
2025-05-07T20:33:37.2008518Z         contiguous: bool,
2025-05-07T20:33:37.2008600Z         compiled: bool,
2025-05-07T20:33:37.2008678Z     ) -> None:
2025-05-07T20:33:37.2008771Z         torch.manual_seed(2025)
2025-05-07T20:33:37.2008841Z     
2025-05-07T20:33:37.2009009Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.2009080Z     
2025-05-07T20:33:37.2009166Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.2009288Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.2009373Z         x = x_sign * x_clamp
2025-05-07T20:33:37.2009451Z         x0 = x[:, :D]
2025-05-07T20:33:37.2009527Z         x1 = x[:, D:]
2025-05-07T20:33:37.2009595Z     
2025-05-07T20:33:37.2009675Z         if contiguous:
2025-05-07T20:33:37.2009759Z             x0 = x0.contiguous()
2025-05-07T20:33:37.2009841Z             x1 = x1.contiguous()
2025-05-07T20:33:37.2009911Z     
2025-05-07T20:33:37.2009993Z         if scale_ub is not None:
2025-05-07T20:33:37.2010095Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.2010225Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.2010343Z             )
2025-05-07T20:33:37.2010416Z         else:
2025-05-07T20:33:37.2010514Z             scale_ub_tensor = None
2025-05-07T20:33:37.2010579Z     
2025-05-07T20:33:37.2010707Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.2010791Z             op = silu_mul_quant
2025-05-07T20:33:37.2010873Z             if compiled:
2025-05-07T20:33:37.2011015Z                 op = torch.compile(op)
2025-05-07T20:33:37.2011118Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.2011187Z     
2025-05-07T20:33:37.2011276Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.2011281Z 
2025-05-07T20:33:37.2011372Z moe/activation_test.py:117: 
2025-05-07T20:33:37.2011495Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.2011593Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.2011688Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.2012186Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.2012316Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.2012669Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.2012888Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.2013228Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.2013316Z     kernel = self.compile(
2025-05-07T20:33:37.2013713Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.2013881Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.2014044Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.2014049Z 
2025-05-07T20:33:37.2014247Z self = <triton.compiler.compiler.ASTSource object at 0x7f8819b90950>
2025-05-07T20:33:37.2015009Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.2015501Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f88199ab1a0>}
2025-05-07T20:33:37.2016237Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.2016421Z context = <triton._C.libtriton.ir.context object at 0x7f8819b7e830>
2025-05-07T20:33:37.2016432Z 
2025-05-07T20:33:37.2016589Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.2016851Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.2016951Z                            module_map=module_map)
2025-05-07T20:33:37.2017107Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.2017201Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.2017270Z E       ^
2025-05-07T20:33:37.2017619Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.2017623Z 
2025-05-07T20:33:37.2018042Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.2018047Z 
2025-05-07T20:33:37.2018144Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.2018369Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.2018446Z     T=128,
2025-05-07T20:33:37.2018517Z     D=7168,
2025-05-07T20:33:37.2018642Z     scale_ub=None,
2025-05-07T20:33:37.2018725Z     contiguous=True,
2025-05-07T20:33:37.2018809Z     compiled=False,
2025-05-07T20:33:37.2018879Z )
2025-05-07T20:33:37.2019090Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.2019254Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:37.2019325Z 
2025-05-07T20:33:37.2019399Z     @given(
2025-05-07T20:33:37.2019511Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.2019607Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.2019716Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.2019827Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.2019942Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.2020015Z     )
2025-05-07T20:33:37.2020251Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.2020345Z     def test_silu_mul_quant(
2025-05-07T20:33:37.2020419Z         self,
2025-05-07T20:33:37.2020491Z         T: int,
2025-05-07T20:33:37.2020632Z         D: int,
2025-05-07T20:33:37.2020733Z         scale_ub: Optional[float],
2025-05-07T20:33:37.2020838Z         contiguous: bool,
2025-05-07T20:33:37.2020916Z         compiled: bool,
2025-05-07T20:33:37.2020988Z     ) -> None:
2025-05-07T20:33:37.2021086Z         torch.manual_seed(2025)
2025-05-07T20:33:37.2021156Z     
2025-05-07T20:33:37.2021316Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.2021391Z     
2025-05-07T20:33:37.2021476Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.2021593Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.2021677Z         x = x_sign * x_clamp
2025-05-07T20:33:37.2021749Z         x0 = x[:, :D]
2025-05-07T20:33:37.2021862Z         x1 = x[:, D:]
2025-05-07T20:33:37.2021931Z     
2025-05-07T20:33:37.2022006Z         if contiguous:
2025-05-07T20:33:37.2022088Z             x0 = x0.contiguous()
2025-05-07T20:33:37.2022175Z             x1 = x1.contiguous()
2025-05-07T20:33:37.2022244Z     
2025-05-07T20:33:37.2022332Z         if scale_ub is not None:
2025-05-07T20:33:37.2022431Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.2022557Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.2022626Z             )
2025-05-07T20:33:37.2022702Z         else:
2025-05-07T20:33:37.2022789Z             scale_ub_tensor = None
2025-05-07T20:33:37.2022864Z     
2025-05-07T20:33:37.2022987Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.2023071Z             op = silu_mul_quant
2025-05-07T20:33:37.2023157Z             if compiled:
2025-05-07T20:33:37.2023251Z                 op = torch.compile(op)
2025-05-07T20:33:37.2023349Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.2023426Z     
2025-05-07T20:33:37.2023512Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.2023516Z 
2025-05-07T20:33:37.2023614Z moe/activation_test.py:117: 
2025-05-07T20:33:37.2023738Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.2023829Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.2023922Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.2024408Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.2024500Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.2024855Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.2025068Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.2025402Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.2025492Z     kernel = self.compile(
2025-05-07T20:33:37.2025930Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.2026101Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.2026219Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.2026224Z 
2025-05-07T20:33:37.2026421Z self = <triton.compiler.compiler.ASTSource object at 0x7f89ca1e6950>
2025-05-07T20:33:37.2027219Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.2027752Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8819b78040>}
2025-05-07T20:33:37.2028491Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.2028711Z context = <triton._C.libtriton.ir.context object at 0x7f8819bb0370>
2025-05-07T20:33:37.2028716Z 
2025-05-07T20:33:37.2028876Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.2029128Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.2029232Z                            module_map=module_map)
2025-05-07T20:33:37.2029389Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.2029478Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.2029552Z E       ^
2025-05-07T20:33:37.2029897Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.2029942Z 
2025-05-07T20:33:37.2030355Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.2030360Z 
2025-05-07T20:33:37.2030471Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.2030724Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.2030798Z     T=2048,
2025-05-07T20:33:37.2030873Z     D=7168,
2025-05-07T20:33:37.2030950Z     scale_ub=1200.0,
2025-05-07T20:33:37.2031035Z     contiguous=True,
2025-05-07T20:33:37.2031115Z     compiled=False,
2025-05-07T20:33:37.2031181Z )
2025-05-07T20:33:37.2031395Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.2031560Z T = 2048, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:37.2031564Z 
2025-05-07T20:33:37.2031636Z     @given(
2025-05-07T20:33:37.2031751Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.2031847Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.2031958Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.2032071Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.2032181Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.2032250Z     )
2025-05-07T20:33:37.2032487Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.2032573Z     def test_silu_mul_quant(
2025-05-07T20:33:37.2032647Z         self,
2025-05-07T20:33:37.2032721Z         T: int,
2025-05-07T20:33:37.2032791Z         D: int,
2025-05-07T20:33:37.2032884Z         scale_ub: Optional[float],
2025-05-07T20:33:37.2032968Z         contiguous: bool,
2025-05-07T20:33:37.2033046Z         compiled: bool,
2025-05-07T20:33:37.2033120Z     ) -> None:
2025-05-07T20:33:37.2033209Z         torch.manual_seed(2025)
2025-05-07T20:33:37.2033277Z     
2025-05-07T20:33:37.2033440Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.2035264Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.69 GiB is allocated by PyTorch, and 59.18 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:37.2035312Z 
2025-05-07T20:33:37.2035423Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:37.2035428Z 
2025-05-07T20:33:37.2035522Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.2035740Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.2035810Z     T=1,
2025-05-07T20:33:37.2035883Z     D=5120,
2025-05-07T20:33:37.2035963Z     scale_ub=1200.0,
2025-05-07T20:33:37.2036040Z     contiguous=True,
2025-05-07T20:33:37.2036117Z     compiled=False,
2025-05-07T20:33:37.2036192Z )
2025-05-07T20:33:37.2036439Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.2036602Z T = 1, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:37.2036610Z 
2025-05-07T20:33:37.2036681Z     @given(
2025-05-07T20:33:37.2036790Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.2036888Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.2036996Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.2037107Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.2037218Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.2037286Z     )
2025-05-07T20:33:37.2037523Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.2037654Z     def test_silu_mul_quant(
2025-05-07T20:33:37.2037726Z         self,
2025-05-07T20:33:37.2037793Z         T: int,
2025-05-07T20:33:37.2037869Z         D: int,
2025-05-07T20:33:37.2037963Z         scale_ub: Optional[float],
2025-05-07T20:33:37.2038051Z         contiguous: bool,
2025-05-07T20:33:37.2038129Z         compiled: bool,
2025-05-07T20:33:37.2038200Z     ) -> None:
2025-05-07T20:33:37.2038292Z         torch.manual_seed(2025)
2025-05-07T20:33:37.2038359Z     
2025-05-07T20:33:37.2038517Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.2038590Z     
2025-05-07T20:33:37.2038677Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.2038795Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.2038880Z         x = x_sign * x_clamp
2025-05-07T20:33:37.2038955Z         x0 = x[:, :D]
2025-05-07T20:33:37.2039029Z         x1 = x[:, D:]
2025-05-07T20:33:37.2039098Z     
2025-05-07T20:33:37.2039175Z         if contiguous:
2025-05-07T20:33:37.2039265Z             x0 = x0.contiguous()
2025-05-07T20:33:37.2039347Z             x1 = x1.contiguous()
2025-05-07T20:33:37.2039412Z     
2025-05-07T20:33:37.2039499Z         if scale_ub is not None:
2025-05-07T20:33:37.2039601Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.2039728Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.2039802Z             )
2025-05-07T20:33:37.2039871Z         else:
2025-05-07T20:33:37.2039961Z             scale_ub_tensor = None
2025-05-07T20:33:37.2040029Z     
2025-05-07T20:33:37.2040536Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.2040668Z             op = silu_mul_quant
2025-05-07T20:33:37.2040751Z             if compiled:
2025-05-07T20:33:37.2040844Z                 op = torch.compile(op)
2025-05-07T20:33:37.2040945Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.2041012Z     
2025-05-07T20:33:37.2041098Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.2041107Z 
2025-05-07T20:33:37.2041199Z moe/activation_test.py:117: 
2025-05-07T20:33:37.2041320Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.2041502Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.2041599Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.2042085Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.2042177Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.2042590Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.2042806Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.2043148Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.2043237Z     kernel = self.compile(
2025-05-07T20:33:37.2043633Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.2043816Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.2043992Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.2043997Z 
2025-05-07T20:33:37.2044199Z self = <triton.compiler.compiler.ASTSource object at 0x7f8819b918d0>
2025-05-07T20:33:37.2044959Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.2045451Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8819b79580>}
2025-05-07T20:33:37.2046185Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.2046432Z context = <triton._C.libtriton.ir.context object at 0x7f881980cb70>
2025-05-07T20:33:37.2046436Z 
2025-05-07T20:33:37.2046600Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.2046855Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.2046954Z                            module_map=module_map)
2025-05-07T20:33:37.2047118Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.2047212Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.2047289Z E       ^
2025-05-07T20:33:37.2047639Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.2047643Z 
2025-05-07T20:33:37.2048052Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.2048060Z 
2025-05-07T20:33:37.2048159Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.2048375Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.2048455Z     T=2048,
2025-05-07T20:33:37.2048525Z     D=5120,
2025-05-07T20:33:37.2048600Z     scale_ub=None,
2025-05-07T20:33:37.2048684Z     contiguous=True,
2025-05-07T20:33:37.2048763Z     compiled=False,
2025-05-07T20:33:37.2048830Z )
2025-05-07T20:33:37.2049046Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.2049211Z T = 2048, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:37.2049215Z 
2025-05-07T20:33:37.2049286Z     @given(
2025-05-07T20:33:37.2049402Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.2049494Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.2049602Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.2049718Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.2049823Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.2049941Z     )
2025-05-07T20:33:37.2050179Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.2050264Z     def test_silu_mul_quant(
2025-05-07T20:33:37.2050340Z         self,
2025-05-07T20:33:37.2050412Z         T: int,
2025-05-07T20:33:37.2050480Z         D: int,
2025-05-07T20:33:37.2050573Z         scale_ub: Optional[float],
2025-05-07T20:33:37.2050696Z         contiguous: bool,
2025-05-07T20:33:37.2050773Z         compiled: bool,
2025-05-07T20:33:37.2050848Z     ) -> None:
2025-05-07T20:33:37.2050935Z         torch.manual_seed(2025)
2025-05-07T20:33:37.2051001Z     
2025-05-07T20:33:37.2051163Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.2051231Z     
2025-05-07T20:33:37.2051321Z >       x_sign = torch.sign(x)
2025-05-07T20:33:37.2053113Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:37.2053124Z 
2025-05-07T20:33:37.2053239Z moe/activation_test.py:94: OutOfMemoryError
2025-05-07T20:33:37.2053244Z 
2025-05-07T20:33:37.2053342Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.2053557Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.2053634Z     T=16384,
2025-05-07T20:33:37.2053705Z     D=5120,
2025-05-07T20:33:37.2053818Z     scale_ub=None,
2025-05-07T20:33:37.2053901Z     contiguous=True,
2025-05-07T20:33:37.2053981Z     compiled=False,
2025-05-07T20:33:37.2054051Z )
2025-05-07T20:33:37.2054269Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.2054440Z T = 16384, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:37.2054444Z 
2025-05-07T20:33:37.2054522Z     @given(
2025-05-07T20:33:37.2054635Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.2054728Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.2054849Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.2054961Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.2055070Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.2055147Z     )
2025-05-07T20:33:37.2055386Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.2055475Z     def test_silu_mul_quant(
2025-05-07T20:33:37.2055549Z         self,
2025-05-07T20:33:37.2055620Z         T: int,
2025-05-07T20:33:37.2055691Z         D: int,
2025-05-07T20:33:37.2055784Z         scale_ub: Optional[float],
2025-05-07T20:33:37.2055868Z         contiguous: bool,
2025-05-07T20:33:37.2055951Z         compiled: bool,
2025-05-07T20:33:37.2056023Z     ) -> None:
2025-05-07T20:33:37.2056109Z         torch.manual_seed(2025)
2025-05-07T20:33:37.2056178Z     
2025-05-07T20:33:37.2056335Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.2058093Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 320.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:37.2058101Z 
2025-05-07T20:33:37.2058257Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:37.2058262Z 
2025-05-07T20:33:37.2058363Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.2058581Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.2058659Z     T=4096,
2025-05-07T20:33:37.2058737Z     D=5120,
2025-05-07T20:33:37.2058852Z     scale_ub=None,
2025-05-07T20:33:37.2058930Z     contiguous=True,
2025-05-07T20:33:37.2059011Z     compiled=False,
2025-05-07T20:33:37.2059083Z )
2025-05-07T20:33:37.2059291Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.2059459Z T = 4096, D = 5120, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:37.2059463Z 
2025-05-07T20:33:37.2059535Z     @given(
2025-05-07T20:33:37.2059645Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.2059741Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.2059851Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.2063305Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.2063490Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.2063561Z     )
2025-05-07T20:33:37.2063806Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.2063893Z     def test_silu_mul_quant(
2025-05-07T20:33:37.2063968Z         self,
2025-05-07T20:33:37.2064039Z         T: int,
2025-05-07T20:33:37.2064110Z         D: int,
2025-05-07T20:33:37.2064202Z         scale_ub: Optional[float],
2025-05-07T20:33:37.2064289Z         contiguous: bool,
2025-05-07T20:33:37.2064367Z         compiled: bool,
2025-05-07T20:33:37.2064441Z     ) -> None:
2025-05-07T20:33:37.2064530Z         torch.manual_seed(2025)
2025-05-07T20:33:37.2064596Z     
2025-05-07T20:33:37.2064831Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.2066586Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:37.2066595Z 
2025-05-07T20:33:37.2066709Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:37.2066714Z 
2025-05-07T20:33:37.2066811Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.2067027Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.2067107Z     T=2048,
2025-05-07T20:33:37.2067179Z     D=5120,
2025-05-07T20:33:37.2067254Z     scale_ub=None,
2025-05-07T20:33:37.2067340Z     contiguous=False,
2025-05-07T20:33:37.2067491Z     compiled=False,
2025-05-07T20:33:37.2067561Z )
2025-05-07T20:33:37.2067777Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.2067943Z T = 2048, D = 5120, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:37.2067948Z 
2025-05-07T20:33:37.2068023Z     @given(
2025-05-07T20:33:37.2068135Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.2068227Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.2068339Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.2068447Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.2068552Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.2068628Z     )
2025-05-07T20:33:37.2068864Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.2068962Z     def test_silu_mul_quant(
2025-05-07T20:33:37.2069034Z         self,
2025-05-07T20:33:37.2069103Z         T: int,
2025-05-07T20:33:37.2069221Z         D: int,
2025-05-07T20:33:37.2069322Z         scale_ub: Optional[float],
2025-05-07T20:33:37.2069411Z         contiguous: bool,
2025-05-07T20:33:37.2069498Z         compiled: bool,
2025-05-07T20:33:37.2069575Z     ) -> None:
2025-05-07T20:33:37.2069670Z         torch.manual_seed(2025)
2025-05-07T20:33:37.2069746Z     
2025-05-07T20:33:37.2069948Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.2071697Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:37.2071705Z 
2025-05-07T20:33:37.2071854Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:37.2071859Z 
2025-05-07T20:33:37.2071956Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.2072176Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.2072250Z     T=4096,
2025-05-07T20:33:37.2072333Z     D=7168,
2025-05-07T20:33:37.2072413Z     scale_ub=None,
2025-05-07T20:33:37.2072492Z     contiguous=True,
2025-05-07T20:33:37.2072574Z     compiled=True,
2025-05-07T20:33:37.2072644Z )
2025-05-07T20:33:37.2072852Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.2073016Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:37.2073065Z 
2025-05-07T20:33:37.2073139Z     @given(
2025-05-07T20:33:37.2073249Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.2073346Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.2073453Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.2073567Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.2073672Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.2073741Z     )
2025-05-07T20:33:37.2073979Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.2074069Z     def test_silu_mul_quant(
2025-05-07T20:33:37.2074142Z         self,
2025-05-07T20:33:37.2074221Z         T: int,
2025-05-07T20:33:37.2074293Z         D: int,
2025-05-07T20:33:37.2074382Z         scale_ub: Optional[float],
2025-05-07T20:33:37.2074471Z         contiguous: bool,
2025-05-07T20:33:37.2074551Z         compiled: bool,
2025-05-07T20:33:37.2074624Z     ) -> None:
2025-05-07T20:33:37.2074716Z         torch.manual_seed(2025)
2025-05-07T20:33:37.2074789Z     
2025-05-07T20:33:37.2074953Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.2076696Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:37.2076704Z 
2025-05-07T20:33:37.2076816Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:37.2076820Z 
2025-05-07T20:33:37.2076914Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.2077127Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.2077204Z     T=2048,
2025-05-07T20:33:37.2077273Z     D=5120,
2025-05-07T20:33:37.2077347Z     scale_ub=1200.0,
2025-05-07T20:33:37.2077474Z     contiguous=False,
2025-05-07T20:33:37.2077554Z     compiled=False,
2025-05-07T20:33:37.2077624Z )
2025-05-07T20:33:37.2077832Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.2077998Z T = 2048, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:37.2078003Z 
2025-05-07T20:33:37.2078115Z     @given(
2025-05-07T20:33:37.2078225Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.2078317Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.2078426Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.2078535Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.2078640Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.2078710Z     )
2025-05-07T20:33:37.2078947Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.2079037Z     def test_silu_mul_quant(
2025-05-07T20:33:37.2079110Z         self,
2025-05-07T20:33:37.2079183Z         T: int,
2025-05-07T20:33:37.2079294Z         D: int,
2025-05-07T20:33:37.2079389Z         scale_ub: Optional[float],
2025-05-07T20:33:37.2079477Z         contiguous: bool,
2025-05-07T20:33:37.2079561Z         compiled: bool,
2025-05-07T20:33:37.2079634Z     ) -> None:
2025-05-07T20:33:37.2079722Z         torch.manual_seed(2025)
2025-05-07T20:33:37.2079797Z     
2025-05-07T20:33:37.2079956Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.2081698Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:37.2081743Z 
2025-05-07T20:33:37.2081854Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:37.2081858Z 
2025-05-07T20:33:37.2081953Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.2082170Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.2082244Z     T=4096,
2025-05-07T20:33:37.2082318Z     D=7168,
2025-05-07T20:33:37.2082393Z     scale_ub=1200.0,
2025-05-07T20:33:37.2082469Z     contiguous=True,
2025-05-07T20:33:37.2082549Z     compiled=False,
2025-05-07T20:33:37.2082618Z )
2025-05-07T20:33:37.2082825Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.2082993Z T = 4096, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:37.2083000Z 
2025-05-07T20:33:37.2083073Z     @given(
2025-05-07T20:33:37.2083188Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.2083283Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.2083393Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.2083506Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.2083613Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.2083683Z     )
2025-05-07T20:33:37.2083923Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.2084011Z     def test_silu_mul_quant(
2025-05-07T20:33:37.2084081Z         self,
2025-05-07T20:33:37.2084157Z         T: int,
2025-05-07T20:33:37.2084229Z         D: int,
2025-05-07T20:33:37.2084320Z         scale_ub: Optional[float],
2025-05-07T20:33:37.2084406Z         contiguous: bool,
2025-05-07T20:33:37.2084486Z         compiled: bool,
2025-05-07T20:33:37.2084558Z     ) -> None:
2025-05-07T20:33:37.2084648Z         torch.manual_seed(2025)
2025-05-07T20:33:37.2084713Z     
2025-05-07T20:33:37.2084919Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.2086665Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:37.2086707Z 
2025-05-07T20:33:37.2086818Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:37.2086822Z 
2025-05-07T20:33:37.2086917Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.2087132Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.2087206Z     T=16384,
2025-05-07T20:33:37.2087277Z     D=7168,
2025-05-07T20:33:37.2087352Z     scale_ub=None,
2025-05-07T20:33:37.2087469Z     contiguous=False,
2025-05-07T20:33:37.2087547Z     compiled=True,
2025-05-07T20:33:37.2087612Z )
2025-05-07T20:33:37.2087824Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.2087992Z T = 16384, D = 7168, scale_ub = None, contiguous = False, compiled = True
2025-05-07T20:33:37.2088000Z 
2025-05-07T20:33:37.2088071Z     @given(
2025-05-07T20:33:37.2088179Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.2088272Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.2088383Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.2088491Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.2088638Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.2088711Z     )
2025-05-07T20:33:37.2088947Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.2089035Z     def test_silu_mul_quant(
2025-05-07T20:33:37.2089108Z         self,
2025-05-07T20:33:37.2089177Z         T: int,
2025-05-07T20:33:37.2089250Z         D: int,
2025-05-07T20:33:37.2089338Z         scale_ub: Optional[float],
2025-05-07T20:33:37.2089418Z         contiguous: bool,
2025-05-07T20:33:37.2089498Z         compiled: bool,
2025-05-07T20:33:37.2089574Z     ) -> None:
2025-05-07T20:33:37.2089659Z         torch.manual_seed(2025)
2025-05-07T20:33:37.2089728Z     
2025-05-07T20:33:37.2089885Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.2091681Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:37.2091691Z 
2025-05-07T20:33:37.2091799Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:37.2091803Z 
2025-05-07T20:33:37.2091897Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.2092113Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.2092182Z     T=4096,
2025-05-07T20:33:37.2092252Z     D=7168,
2025-05-07T20:33:37.2092328Z     scale_ub=None,
2025-05-07T20:33:37.2092404Z     contiguous=True,
2025-05-07T20:33:37.2092483Z     compiled=False,
2025-05-07T20:33:37.2092548Z )
2025-05-07T20:33:37.2092754Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.2092919Z T = 4096, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:37.2092924Z 
2025-05-07T20:33:37.2093037Z     @given(
2025-05-07T20:33:37.2093149Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.2093244Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.2093348Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.2093459Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.2093566Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.2093671Z     )
2025-05-07T20:33:37.2093909Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.2093994Z     def test_silu_mul_quant(
2025-05-07T20:33:37.2094061Z         self,
2025-05-07T20:33:37.2094135Z         T: int,
2025-05-07T20:33:37.2094202Z         D: int,
2025-05-07T20:33:37.2094292Z         scale_ub: Optional[float],
2025-05-07T20:33:37.2094380Z         contiguous: bool,
2025-05-07T20:33:37.2094459Z         compiled: bool,
2025-05-07T20:33:37.2094528Z     ) -> None:
2025-05-07T20:33:37.2094616Z         torch.manual_seed(2025)
2025-05-07T20:33:37.2094686Z     
2025-05-07T20:33:37.2094912Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.2096646Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:37.2096655Z 
2025-05-07T20:33:37.2096766Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:37.2096810Z 
2025-05-07T20:33:37.2096905Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.2097121Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.2097194Z     T=16384,
2025-05-07T20:33:37.2097270Z     D=7168,
2025-05-07T20:33:37.2097344Z     scale_ub=None,
2025-05-07T20:33:37.2097424Z     contiguous=True,
2025-05-07T20:33:37.2097500Z     compiled=False,
2025-05-07T20:33:37.2097565Z )
2025-05-07T20:33:37.2097774Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.2097942Z T = 16384, D = 7168, scale_ub = None, contiguous = True, compiled = False
2025-05-07T20:33:37.2097947Z 
2025-05-07T20:33:37.2098016Z     @given(
2025-05-07T20:33:37.2098126Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.2098218Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.2098327Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.2098434Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.2098542Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.2098612Z     )
2025-05-07T20:33:37.2098849Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.2098940Z     def test_silu_mul_quant(
2025-05-07T20:33:37.2099012Z         self,
2025-05-07T20:33:37.2099082Z         T: int,
2025-05-07T20:33:37.2099154Z         D: int,
2025-05-07T20:33:37.2099242Z         scale_ub: Optional[float],
2025-05-07T20:33:37.2099323Z         contiguous: bool,
2025-05-07T20:33:37.2099407Z         compiled: bool,
2025-05-07T20:33:37.2099479Z     ) -> None:
2025-05-07T20:33:37.2099567Z         torch.manual_seed(2025)
2025-05-07T20:33:37.2099636Z     
2025-05-07T20:33:37.2099794Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.2101624Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:37.2101633Z 
2025-05-07T20:33:37.2101745Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:37.2101787Z 
2025-05-07T20:33:37.2101881Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.2102097Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.2102166Z     T=16384,
2025-05-07T20:33:37.2102240Z     D=7168,
2025-05-07T20:33:37.2102317Z     scale_ub=1200.0,
2025-05-07T20:33:37.2102394Z     contiguous=True,
2025-05-07T20:33:37.2102474Z     compiled=False,
2025-05-07T20:33:37.2102544Z )
2025-05-07T20:33:37.2102751Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.2102922Z T = 16384, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:37.2102927Z 
2025-05-07T20:33:37.2103033Z     @given(
2025-05-07T20:33:37.2103144Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.2103238Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.2103345Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.2103455Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.2103565Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.2103632Z     )
2025-05-07T20:33:37.2103868Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.2103954Z     def test_silu_mul_quant(
2025-05-07T20:33:37.2104023Z         self,
2025-05-07T20:33:37.2104095Z         T: int,
2025-05-07T20:33:37.2104163Z         D: int,
2025-05-07T20:33:37.2104293Z         scale_ub: Optional[float],
2025-05-07T20:33:37.2104378Z         contiguous: bool,
2025-05-07T20:33:37.2104457Z         compiled: bool,
2025-05-07T20:33:37.2104528Z     ) -> None:
2025-05-07T20:33:37.2104618Z         torch.manual_seed(2025)
2025-05-07T20:33:37.2104686Z     
2025-05-07T20:33:37.2104845Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.2106580Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.73 GiB is allocated by PyTorch, and 19.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:37.2106590Z 
2025-05-07T20:33:37.2106702Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:37.2106706Z 
2025-05-07T20:33:37.2106799Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.2107016Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.2107089Z     T=128,
2025-05-07T20:33:37.2107158Z     D=5120,
2025-05-07T20:33:37.2107232Z     scale_ub=1200.0,
2025-05-07T20:33:37.2107311Z     contiguous=False,
2025-05-07T20:33:37.2107388Z     compiled=False,
2025-05-07T20:33:37.2107499Z )
2025-05-07T20:33:37.2107709Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.2107872Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = False, compiled = False
2025-05-07T20:33:37.2107876Z 
2025-05-07T20:33:37.2107947Z     @given(
2025-05-07T20:33:37.2108056Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.2108148Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.2108260Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.2108369Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.2108517Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.2108590Z     )
2025-05-07T20:33:37.2108826Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.2108916Z     def test_silu_mul_quant(
2025-05-07T20:33:37.2108987Z         self,
2025-05-07T20:33:37.2109058Z         T: int,
2025-05-07T20:33:37.2109131Z         D: int,
2025-05-07T20:33:37.2109261Z         scale_ub: Optional[float],
2025-05-07T20:33:37.2109344Z         contiguous: bool,
2025-05-07T20:33:37.2109423Z         compiled: bool,
2025-05-07T20:33:37.2109493Z     ) -> None:
2025-05-07T20:33:37.2109578Z         torch.manual_seed(2025)
2025-05-07T20:33:37.2109649Z     
2025-05-07T20:33:37.2109808Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.2109872Z     
2025-05-07T20:33:37.2109960Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.2110077Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.2110159Z         x = x_sign * x_clamp
2025-05-07T20:33:37.2110238Z         x0 = x[:, :D]
2025-05-07T20:33:37.2110311Z         x1 = x[:, D:]
2025-05-07T20:33:37.2110418Z     
2025-05-07T20:33:37.2110495Z         if contiguous:
2025-05-07T20:33:37.2110578Z             x0 = x0.contiguous()
2025-05-07T20:33:37.2110661Z             x1 = x1.contiguous()
2025-05-07T20:33:37.2110726Z     
2025-05-07T20:33:37.2110807Z         if scale_ub is not None:
2025-05-07T20:33:37.2110910Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.2111037Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.2111104Z             )
2025-05-07T20:33:37.2111176Z         else:
2025-05-07T20:33:37.2111261Z             scale_ub_tensor = None
2025-05-07T20:33:37.2111324Z     
2025-05-07T20:33:37.2111448Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.2111574Z             op = silu_mul_quant
2025-05-07T20:33:37.2111656Z             if compiled:
2025-05-07T20:33:37.2111748Z                 op = torch.compile(op)
2025-05-07T20:33:37.2111849Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.2111915Z     
2025-05-07T20:33:37.2112003Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.2112008Z 
2025-05-07T20:33:37.2112096Z moe/activation_test.py:117: 
2025-05-07T20:33:37.2112220Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.2112316Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.2112406Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.2112902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.2112990Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.2113347Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.2113565Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.2113902Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.2113995Z     kernel = self.compile(
2025-05-07T20:33:37.2114390Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.2114560Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.2114682Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.2114687Z 
2025-05-07T20:33:37.2114880Z self = <triton.compiler.compiler.ASTSource object at 0x7f8819b930d0>
2025-05-07T20:33:37.2115645Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.2116179Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f8819876e80>}
2025-05-07T20:33:37.2116919Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.2117101Z context = <triton._C.libtriton.ir.context object at 0x7f8819763d70>
2025-05-07T20:33:37.2117143Z 
2025-05-07T20:33:37.2117299Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.2117566Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.2117668Z                            module_map=module_map)
2025-05-07T20:33:37.2117825Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.2117920Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.2117993Z E       ^
2025-05-07T20:33:37.2118344Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.2118386Z 
2025-05-07T20:33:37.2118797Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.2118801Z 
2025-05-07T20:33:37.2118899Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.2119115Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.2119186Z     T=2048,
2025-05-07T20:33:37.2119260Z     D=7168,
2025-05-07T20:33:37.2119336Z     scale_ub=None,
2025-05-07T20:33:37.2119417Z     contiguous=False,
2025-05-07T20:33:37.2119495Z     compiled=False,
2025-05-07T20:33:37.2119561Z )
2025-05-07T20:33:37.2119770Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.2119979Z T = 2048, D = 7168, scale_ub = None, contiguous = False, compiled = False
2025-05-07T20:33:37.2119984Z 
2025-05-07T20:33:37.2120053Z     @given(
2025-05-07T20:33:37.2120169Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.2120263Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.2120371Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.2120482Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.2120587Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.2120658Z     )
2025-05-07T20:33:37.2120896Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.2120981Z     def test_silu_mul_quant(
2025-05-07T20:33:37.2121052Z         self,
2025-05-07T20:33:37.2121126Z         T: int,
2025-05-07T20:33:37.2121196Z         D: int,
2025-05-07T20:33:37.2121285Z         scale_ub: Optional[float],
2025-05-07T20:33:37.2121370Z         contiguous: bool,
2025-05-07T20:33:37.2121450Z         compiled: bool,
2025-05-07T20:33:37.2121523Z     ) -> None:
2025-05-07T20:33:37.2121610Z         torch.manual_seed(2025)
2025-05-07T20:33:37.2121682Z     
2025-05-07T20:33:37.2121846Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.2123589Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 26.44 MiB is free. Including non-PyTorch memory, this process has 22.04 GiB memory in use. Of the allocated memory 21.74 GiB is allocated by PyTorch, and 10.99 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:37.2123597Z 
2025-05-07T20:33:37.2123709Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:37.2123715Z 
2025-05-07T20:33:37.2123808Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.2124021Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.2124137Z     T=128,
2025-05-07T20:33:37.2124210Z     D=7168,
2025-05-07T20:33:37.2124286Z     scale_ub=1200.0,
2025-05-07T20:33:37.2124366Z     contiguous=True,
2025-05-07T20:33:37.2124444Z     compiled=True,
2025-05-07T20:33:37.2124513Z )
2025-05-07T20:33:37.2124726Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.2124950Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:37.2124955Z 
2025-05-07T20:33:37.2125034Z     @given(
2025-05-07T20:33:37.2125145Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.2125237Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.2125346Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.2125455Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.2125565Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.2125640Z     )
2025-05-07T20:33:37.2125878Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.2126005Z     def test_silu_mul_quant(
2025-05-07T20:33:37.2126078Z         self,
2025-05-07T20:33:37.2126150Z         T: int,
2025-05-07T20:33:37.2126226Z         D: int,
2025-05-07T20:33:37.2126316Z         scale_ub: Optional[float],
2025-05-07T20:33:37.2126396Z         contiguous: bool,
2025-05-07T20:33:37.2126477Z         compiled: bool,
2025-05-07T20:33:37.2126547Z     ) -> None:
2025-05-07T20:33:37.2126634Z         torch.manual_seed(2025)
2025-05-07T20:33:37.2126708Z     
2025-05-07T20:33:37.2126867Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.2126938Z     
2025-05-07T20:33:37.2127026Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.2127144Z         x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.2127274Z         x = x_sign * x_clamp
2025-05-07T20:33:37.2127347Z         x0 = x[:, :D]
2025-05-07T20:33:37.2127420Z         x1 = x[:, D:]
2025-05-07T20:33:37.2127491Z     
2025-05-07T20:33:37.2127571Z         if contiguous:
2025-05-07T20:33:37.2127658Z             x0 = x0.contiguous()
2025-05-07T20:33:37.2127742Z             x1 = x1.contiguous()
2025-05-07T20:33:37.2127807Z     
2025-05-07T20:33:37.2127890Z         if scale_ub is not None:
2025-05-07T20:33:37.2127996Z             scale_ub_tensor = torch.tensor(
2025-05-07T20:33:37.2128125Z                 [scale_ub], device="cuda", dtype=torch.float32
2025-05-07T20:33:37.2128198Z             )
2025-05-07T20:33:37.2128273Z         else:
2025-05-07T20:33:37.2128361Z             scale_ub_tensor = None
2025-05-07T20:33:37.2128429Z     
2025-05-07T20:33:37.2128555Z         def fn() -> Tuple[torch.Tensor, torch.Tensor]:
2025-05-07T20:33:37.2128637Z             op = silu_mul_quant
2025-05-07T20:33:37.2128720Z             if compiled:
2025-05-07T20:33:37.2128818Z                 op = torch.compile(op)
2025-05-07T20:33:37.2128918Z             return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.2128986Z     
2025-05-07T20:33:37.2129074Z >       y_fp8, y_scale = fn()
2025-05-07T20:33:37.2129078Z 
2025-05-07T20:33:37.2129171Z moe/activation_test.py:117: 
2025-05-07T20:33:37.2129297Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.2129391Z moe/activation_test.py:115: in fn
2025-05-07T20:33:37.2129484Z     return op(x0, x1, scale_ub_tensor)
2025-05-07T20:33:37.2129856Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/torch/_dynamo/eval_frame.py:678: in _fn
2025-05-07T20:33:37.2129944Z     return fn(*args, **kwargs)
2025-05-07T20:33:37.2130446Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/fbgemm_gpu/experimental/gen_ai/moe/activation.py:80: in silu_mul_quant
2025-05-07T20:33:37.2130548Z     _fbgemm_silu_mul_quant[grid](
2025-05-07T20:33:37.2130922Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:330: in <lambda>
2025-05-07T20:33:37.2131190Z     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
2025-05-07T20:33:37.2131529Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/jit.py:623: in run
2025-05-07T20:33:37.2131622Z     kernel = self.compile(
2025-05-07T20:33:37.2132017Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:273: in compile
2025-05-07T20:33:37.2132224Z     module = src.make_ir(options, codegen_fns, module_map, context)
2025-05-07T20:33:37.2132348Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2025-05-07T20:33:37.2132353Z 
2025-05-07T20:33:37.2132546Z self = <triton.compiler.compiler.ASTSource object at 0x7f8819b6c9d0>
2025-05-07T20:33:37.2133311Z options = CUDAOptions(num_warps=4, num_ctas=1, num_stages=3, num_buffers_warp_spec=0, num_consumer_groups=0, reg_dec_producer=0,...site-packages/triton/backends/nvidia/lib/libdevice.10.bc'),), debug=False, backend_name='cuda', sanitize_overflow=True)
2025-05-07T20:33:37.2133843Z codegen_fns = {'convert_custom_types': <function convert_custom_float8_sm80 at 0x7f8a3846fce0>, 'min_dot_size': <function min_dot_size.<locals>.<lambda> at 0x7f88197c7b00>}
2025-05-07T20:33:37.2134576Z module_map = {'triton.language.extra.libdevice': <module 'triton.language.extra.cuda.libdevice' from '/home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/language/extra/cuda/libdevice.py'>}
2025-05-07T20:33:37.2134764Z context = <triton._C.libtriton.ir.context object at 0x7f8819660ab0>
2025-05-07T20:33:37.2134769Z 
2025-05-07T20:33:37.2134923Z     def make_ir(self, options, codegen_fns, module_map, context):
2025-05-07T20:33:37.2135183Z >       return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
2025-05-07T20:33:37.2135283Z                            module_map=module_map)
2025-05-07T20:33:37.2135478Z E       triton.compiler.errors.CompilationError: at 1:0:
2025-05-07T20:33:37.2135575Z E       def _fbgemm_silu_mul_quant(
2025-05-07T20:33:37.2135646Z E       ^
2025-05-07T20:33:37.2135994Z E       ValueError("type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')")
2025-05-07T20:33:37.2136001Z 
2025-05-07T20:33:37.2136412Z /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/compiler/compiler.py:100: CompilationError
2025-05-07T20:33:37.2136417Z 
2025-05-07T20:33:37.2136515Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.2136734Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.2136806Z     T=128,
2025-05-07T20:33:37.2136874Z     D=7168,
2025-05-07T20:33:37.2136953Z     scale_ub=1200.0,
2025-05-07T20:33:37.2137032Z     contiguous=True,
2025-05-07T20:33:37.2137107Z     compiled=False,
2025-05-07T20:33:37.2137175Z )
2025-05-07T20:33:37.2137384Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.2137552Z T = 128, D = 7168, scale_ub = 1200.0, contiguous = True, compiled = False
2025-05-07T20:33:37.2137560Z 
2025-05-07T20:33:37.2137633Z     @given(
2025-05-07T20:33:37.2137746Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.2137842Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.2137949Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.2138059Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.2138171Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.2138240Z     )
2025-05-07T20:33:37.2138474Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.2138563Z     def test_silu_mul_quant(
2025-05-07T20:33:37.2138633Z         self,
2025-05-07T20:33:37.2138710Z         T: int,
2025-05-07T20:33:37.2138782Z         D: int,
2025-05-07T20:33:37.2138873Z         scale_ub: Optional[float],
2025-05-07T20:33:37.2138961Z         contiguous: bool,
2025-05-07T20:33:37.2139041Z         compiled: bool,
2025-05-07T20:33:37.2139115Z     ) -> None:
2025-05-07T20:33:37.2139254Z         torch.manual_seed(2025)
2025-05-07T20:33:37.2139324Z     
2025-05-07T20:33:37.2139487Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.2139557Z     
2025-05-07T20:33:37.2139644Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.2139760Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.2141883Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 6.37 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:37.2141894Z 
2025-05-07T20:33:37.2142010Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:37.2142018Z 
2025-05-07T20:33:37.2142206Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.2142427Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.2142504Z     T=128,
2025-05-07T20:33:37.2142577Z     D=5120,
2025-05-07T20:33:37.2142653Z     scale_ub=1200.0,
2025-05-07T20:33:37.2142736Z     contiguous=True,
2025-05-07T20:33:37.2142810Z     compiled=True,
2025-05-07T20:33:37.2142883Z )
2025-05-07T20:33:37.2143096Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.2143254Z T = 128, D = 5120, scale_ub = 1200.0, contiguous = True, compiled = True
2025-05-07T20:33:37.2143258Z 
2025-05-07T20:33:37.2143330Z     @given(
2025-05-07T20:33:37.2143443Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.2143603Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.2143712Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.2143823Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.2143932Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.2144003Z     )
2025-05-07T20:33:37.2144239Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.2144329Z     def test_silu_mul_quant(
2025-05-07T20:33:37.2144403Z         self,
2025-05-07T20:33:37.2144476Z         T: int,
2025-05-07T20:33:37.2144547Z         D: int,
2025-05-07T20:33:37.2144640Z         scale_ub: Optional[float],
2025-05-07T20:33:37.2144722Z         contiguous: bool,
2025-05-07T20:33:37.2144804Z         compiled: bool,
2025-05-07T20:33:37.2144878Z     ) -> None:
2025-05-07T20:33:37.2144965Z         torch.manual_seed(2025)
2025-05-07T20:33:37.2145035Z     
2025-05-07T20:33:37.2145192Z         x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.2145264Z     
2025-05-07T20:33:37.2145353Z         x_sign = torch.sign(x)
2025-05-07T20:33:37.2145471Z >       x_clamp = torch.clamp(torch.abs(x), 0.01, 2.0)
2025-05-07T20:33:37.2147216Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:37.2147229Z 
2025-05-07T20:33:37.2147339Z moe/activation_test.py:95: OutOfMemoryError
2025-05-07T20:33:37.2147344Z 
2025-05-07T20:33:37.2147489Z Trying example: test_silu_mul_quant(
2025-05-07T20:33:37.2147711Z     self=<moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>,
2025-05-07T20:33:37.2147783Z     T=128,
2025-05-07T20:33:37.2147924Z     D=7168,
2025-05-07T20:33:37.2148002Z     scale_ub=None,
2025-05-07T20:33:37.2148082Z     contiguous=True,
2025-05-07T20:33:37.2148160Z     compiled=True,
2025-05-07T20:33:37.2148228Z )
2025-05-07T20:33:37.2148436Z self = <moe.activation_test.ActivationTests testMethod=test_silu_mul_quant>
2025-05-07T20:33:37.2148600Z T = 128, D = 7168, scale_ub = None, contiguous = True, compiled = True
2025-05-07T20:33:37.2148672Z 
2025-05-07T20:33:37.2148744Z     @given(
2025-05-07T20:33:37.2148853Z         T=st.sampled_from([1, 128, 2048, 4096, 16384]),
2025-05-07T20:33:37.2148948Z         D=st.sampled_from([5120, 7168]),
2025-05-07T20:33:37.2149056Z         scale_ub=st.sampled_from([None, 1200.00]),
2025-05-07T20:33:37.2149168Z         contiguous=st.sampled_from([True, False]),
2025-05-07T20:33:37.2149273Z         compiled=st.sampled_from([True, False]),
2025-05-07T20:33:37.2149345Z     )
2025-05-07T20:33:37.2149581Z     @settings(verbosity=Verbosity.verbose, max_examples=_MAX_SAMPLES, deadline=None)
2025-05-07T20:33:37.2149671Z     def test_silu_mul_quant(
2025-05-07T20:33:37.2149744Z         self,
2025-05-07T20:33:37.2149857Z         T: int,
2025-05-07T20:33:37.2149925Z         D: int,
2025-05-07T20:33:37.2150014Z         scale_ub: Optional[float],
2025-05-07T20:33:37.2150100Z         contiguous: bool,
2025-05-07T20:33:37.2150179Z         compiled: bool,
2025-05-07T20:33:37.2150253Z     ) -> None:
2025-05-07T20:33:37.2150351Z         torch.manual_seed(2025)
2025-05-07T20:33:37.2150417Z     
2025-05-07T20:33:37.2150575Z >       x = torch.randn([T, 2 * D], device="cuda", dtype=torch.bfloat16)
2025-05-07T20:33:37.2152325Z E       torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.07 GiB of which 4.44 MiB is free. Including non-PyTorch memory, this process has 22.06 GiB memory in use. Of the allocated memory 21.77 GiB is allocated by PyTorch, and 3.87 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
2025-05-07T20:33:37.2152372Z 
2025-05-07T20:33:37.2152485Z moe/activation_test.py:92: OutOfMemoryError
2025-05-07T20:33:37.2152614Z =============================== warnings summary ===============================
2025-05-07T20:33:37.2152919Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:37.2153215Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:37.2153505Z ../../../../../../../../miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108
2025-05-07T20:33:37.2154377Z   /home/ec2-user/miniconda/envs/build_binary/lib/python3.13/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
2025-05-07T20:33:37.2154602Z     warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "
2025-05-07T20:33:37.2154606Z 
2025-05-07T20:33:37.2154808Z -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
2025-05-07T20:33:37.2154971Z ================= 1 failed, 1 deselected, 3 warnings in 12.06s =================
2025-05-07T20:33:38.9011878Z ERROR conda.cli.main_run:execute(125): `conda run python -m pytest -v -rsx -s -W ignore::pytest.PytestCollectionWarning --lf --last-failed-no-failures none ./moe/activation_test.py` failed. (See above for error)
2025-05-07T20:33:38.9643212Z [EXEC] [ATTEMPT 2/2] Command attempt failed.
2025-05-07T20:33:38.9643843Z 
2025-05-07T20:33:38.9644306Z [EXEC] The command has failed after 2 + 1 attempts; aborting.
2025-05-07T20:33:38.9644926Z [TEST] Python test suite FAILED for some or all tests despite multiple retries: ./moe/activation_test.py
2025-05-07T20:33:38.9645319Z 
2025-05-07T20:33:38.9645637Z 
2025-05-07T20:33:38.9645642Z 
2025-05-07T20:33:38.9662253Z ##[error]Process completed with exit code 1.
2025-05-07T20:33:38.9751657Z Post job cleanup.
2025-05-07T20:33:39.0718936Z [command]/usr/bin/git version
2025-05-07T20:33:39.0762264Z git version 2.47.1
2025-05-07T20:33:39.0799567Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/cfa129f4-ffae-4973-bfbb-710246d077a2/.gitconfig'
2025-05-07T20:33:39.0810408Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/cfa129f4-ffae-4973-bfbb-710246d077a2' before making global git config changes
2025-05-07T20:33:39.0811274Z Adding repository directory to the temporary git global config as a safe directory
2025-05-07T20:33:39.0816117Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/FBGEMM/FBGEMM
2025-05-07T20:33:39.0859692Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2025-05-07T20:33:39.0894256Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
2025-05-07T20:33:39.1226814Z Entering 'external/asmjit'
2025-05-07T20:33:39.1294155Z Entering 'external/composable_kernel'
2025-05-07T20:33:39.1368099Z Entering 'external/cpuinfo'
2025-05-07T20:33:39.1435021Z Entering 'external/cutlass'
2025-05-07T20:33:39.1512755Z Entering 'external/googletest'
2025-05-07T20:33:39.1579823Z Entering 'external/hipify_torch'
2025-05-07T20:33:39.1646519Z Entering 'external/json'
2025-05-07T20:33:39.1732216Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2025-05-07T20:33:39.1757453Z http.https://github.com/.extraheader
2025-05-07T20:33:39.1769420Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader
2025-05-07T20:33:39.1800390Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
2025-05-07T20:33:39.2131695Z Entering 'external/asmjit'
2025-05-07T20:33:39.2173615Z http.https://github.com/.extraheader
2025-05-07T20:33:39.2216623Z Entering 'external/composable_kernel'
2025-05-07T20:33:39.2259759Z http.https://github.com/.extraheader
2025-05-07T20:33:39.2308808Z Entering 'external/cpuinfo'
2025-05-07T20:33:39.2352296Z http.https://github.com/.extraheader
2025-05-07T20:33:39.2395371Z Entering 'external/cutlass'
2025-05-07T20:33:39.2437781Z http.https://github.com/.extraheader
2025-05-07T20:33:39.2489360Z Entering 'external/googletest'
2025-05-07T20:33:39.2532282Z http.https://github.com/.extraheader
2025-05-07T20:33:39.2575349Z Entering 'external/hipify_torch'
2025-05-07T20:33:39.2616929Z http.https://github.com/.extraheader
2025-05-07T20:33:39.2659389Z Entering 'external/json'
2025-05-07T20:33:39.2701386Z http.https://github.com/.extraheader
2025-05-07T20:33:39.2855015Z A job completed hook has been configured by the self-hosted runner administrator
2025-05-07T20:33:39.2886832Z ##[group]Run '/home/ec2-user/runner-scripts/after_job.sh'
2025-05-07T20:33:39.2897126Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0}
2025-05-07T20:33:39.2897486Z ##[endgroup]
2025-05-07T20:33:39.2997648Z [!ALERT!] Swap in detected! [!ALERT!]
2025-05-07T20:33:50.0948651Z [!ALERT!] Swap out detected [!ALERT!]
2025-05-07T20:34:06.6556237Z Cleaning up orphan processes